CN107172485B

CN107172485B - method and device for generating short video and input equipment

Info

Publication number: CN107172485B
Application number: CN201710278060.3A
Authority: CN
Inventors: 门文
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-04-25
Filing date: 2017-04-25
Publication date: 2020-01-31
Anticipated expiration: 2037-04-25
Also published as: CN107172485A

Abstract

The invention aims to provide methods and devices for generating short videos, wherein or more pictures and or more voice information of or more pictures of a user are acquired, subtitle information corresponding to the voice information is determined according to the content of the voice information, display special effects corresponding to the pictures and/or the subtitle information are determined according to the voice characteristics and/or the semantic characteristics of the voice information, and the pictures and the voice information are generated into the short videos according to the subtitle information and the display special effects.

Description

method and device for generating short video and input equipment

Technical Field

The invention relates to the technical field of computers, in particular to technologies for generating short videos.

Background

In the prior art, the input method mainly includes expressions in the form of pictures such as emoji, characters, emoticons, GIF, and the like, and in the scenes of the content generated by the user such as forums, microblogs, and the like, the input method mainly includes expressions in the form of various static or dynamic pictures. The expressions are fixed single pictures or GIF pictures, and a user can directly select to send the expressions when sending the expressions.

However, since the content of the emoticon is fixed, if the user wants to supplement the emoticon or desires to express the emotion corresponding to the emoticon in a harmonious manner, the user needs to input text, voice, or the like separately to emphasize the emotion expression , the user may also create a short video alone as a new emoticon, whereas the -like short video creation requires the user to photograph short videos and convert them into the emoticon, or combine a plurality of pictures, videos, audios, or subtitles using image processing software, which is too complicated for the user and has a higher cost than directly transmitting the emoticon, and therefore, the user is still used to express the content using separate emoticons, voice, characters, or the like, resulting in a low expression efficiency and a form .

Disclosure of Invention

The invention aims to provide methods and devices for generating short videos.

According to aspects of the invention, methods for generating short videos are provided, wherein the method comprises the steps of:

a, acquiring or more pictures and or more voice information of or more pictures from a user;

b, determining subtitle information corresponding to the voice information according to the content of the voice information;

c, determining a display special effect corresponding to the picture and/or the subtitle information according to the voice characteristics and/or the semantic characteristics of the voice information;

and d, generating the short video by the picture and the voice information according to the subtitle information and the display special effect.

Optionally, the step c includes:

-determining a display special effect corresponding to the picture and/or the subtitle information according to the voice features and/or the semantic features of the voice information and according to the picture features of the picture.

Optionally, the step c includes:

determining a display special effect corresponding to the picture and/or the subtitle information according to the voice characteristics and/or the semantic characteristics of the voice information and by combining the voice length of the voice information.

Optionally, the method further comprises:

determining or more related pictures associated with the pictures according to the voice information and the pictures;

wherein, the method also comprises:

- one or more related short videos are generated from the pictures, the related pictures and the voice information according to the subtitle information and the presentation special effects.

Optionally, the step x includes:

x1 determining a number of related pictures associated with the picture;

-determining or more relevant pictures associated with the picture from the speech information, the picture and the number of relevant pictures.

Optionally, the step x1 includes at least any of :

-determining a number of relevant pictures associated with the picture in dependence of a speech length of the speech information;

-determining a number of relevant pictures associated with the picture in dependence on a speech feature of the speech information;

-determining a number of relevant pictures associated with the picture in dependence on semantic features of the speech information.

Optionally, the method further comprises:

-obtaining one or more historical speech information of the user, determining a user speech feature library corresponding to the user;

wherein the step c comprises:

-determining speech features corresponding to the speech information from the user speech feature library;

-determining a presentation special effect corresponding to the picture and/or the subtitle information according to the voice feature and/or the semantic feature of the voice information.

Optionally, the method further comprises:

-unloading the short video into or more application-usable formats according to the relevant configuration information of the application to which the short video corresponds;

-adding the short video in the application usable format.

Optionally, the show effects include or more dynamic effects.

According to another aspect of the present invention, there is also provided generation apparatuses for generating short videos, wherein the generation apparatuses include:

acquiring means for acquiring or more pictures and or more voice messages of the or more pictures from the user;

the caption determining device is used for determining caption information corresponding to the voice information according to the content of the voice information;

the special effect determining device is used for determining a display special effect corresponding to the picture and/or the subtitle information according to the voice characteristics and/or the semantic characteristics of the voice information;

and the video generating device is used for generating the short video from the picture and the voice information according to the subtitle information and the display special effect.

Optionally, the special effect determining device is configured to:

Optionally, the generating device further includes:

relevant picture determining means for determining or more relevant pictures associated with the picture according to the voice information and the picture;

wherein the generating means further comprises:

and the related video generation device is used for generating or a plurality of related short videos by the pictures, the related pictures and the voice information according to the subtitle information and the display special effect.

Optionally, the related picture determining apparatus includes:

a number determination unit for determining a number of related pictures associated with the picture;

and the association determining unit is used for determining or a plurality of related pictures associated with the pictures according to the voice information, the pictures and the number of the related pictures.

Optionally, the quantity determination unit is for at least any of items:

Optionally, the generating device further includes:

history obtaining means for obtaining or more pieces of history voice information of the user, and determining a user voice feature library corresponding to the user;

wherein the special effect determining means is configured to:

Optionally, the generating device further includes:

the unloading device is used for unloading the short video into or a plurality of application available formats according to the relevant configuration information of the application corresponding to the short video;

adding means for adding the short video in the application-usable format.

Optionally, the show effects include or more dynamic effects.

According to a further aspect of the invention, there is also provided kinds of input devices, including the generating apparatus as claimed in any of above.

Compared with the prior art, the method and the device have the advantages that or more pictures and or more voice information of or more pictures from a user are obtained, the subtitle information corresponding to the voice information is determined according to the content of the voice information, the display special effect corresponding to the pictures and/or the subtitle information is determined according to the voice characteristics and/or the semantic characteristics of the voice information, and the short video is generated by the pictures and the voice information according to the subtitle information and the display special effect.

Moreover, the method and the device can also determine the display special effect corresponding to the picture and/or the subtitle information according to the voice characteristics and/or the semantic characteristics of the voice information and by combining with the picture characteristics of the picture, or determine the display special effect corresponding to the picture and/or the subtitle information according to the voice characteristics and/or the semantic characteristics of the voice information and by combining with the voice length of the voice information.

Moreover, the invention can also determine or more related pictures related to the pictures according to the voice information and the pictures, and generate or more related short videos from the pictures, the related pictures and the voice information according to the subtitle information and the display special effect, so that the invention can generate various related short videos for users, reduce the operation of searching pictures for the users, improve the efficiency of acquiring information, provide more choices for the users, further improve the attraction and improve the user experience by steps.

Moreover, the method can also obtain or more historical voice information of the user, determine a user voice feature library corresponding to the user, determine the voice feature corresponding to the voice information according to the user voice feature library, and determine the display special effect corresponding to the picture and/or the subtitle information according to the voice feature and/or the semantic feature of the voice information.

The short video can be transferred to or a plurality of application available formats according to the relevant configuration information of the application corresponding to the short video, and the short video is added to the application in the application available format, so that the expression form of information expression in the application is enriched, and the information expression form of a user is enriched and more attractive.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 shows a schematic diagram of generation devices for generating short videos, in accordance with aspects of the present invention;

FIG. 2 shows a schematic diagram of a generation apparatus for generating short videos according to preferred embodiments of the present invention;

FIG. 3 illustrates a flowchart of methods for generating short videos, in accordance with another aspects of the present invention;

fig. 4 shows a flow chart of methods for generating short videos according to preferred embodiments of the present invention.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

Before discussing exemplary embodiments in greater detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts.

The "generating device" is a "computer device" and is also referred to as a "computer", and refers to an intelligent electronic device that can execute a predetermined process such as numerical calculation and/or logic calculation by running a predetermined program or instruction, and may include a processor and a memory, where the processor executes a pre-stored instruction stored in the memory to execute the predetermined process, or the processor executes a predetermined process by hardware such as ASIC, FPGA, DSP, or a combination thereof.

The computer equipment comprises user equipment and/or network equipment, wherein the user equipment comprises but is not limited to a computer, a smart phone, a PDA and the like, the network equipment comprises but is not limited to a single network server, a server group consisting of a plurality of network servers or a Cloud Computing (Cloud Computing) -based Cloud consisting of a large number of computers or network servers, wherein the Cloud Computing is of distributed Computing, and super virtual computers consisting of groups of loosely coupled computers are arranged, the computer equipment can be independently operated to realize the invention, and can also be accessed into a network and realize the invention through the interaction with other computer equipment in the network, and the network where the computer equipment is positioned comprises but is not limited to the internet, a domain network, a metropolitan area network, a local area network, a VPN network and the like.

Those skilled in the art should understand that the "generating device" described in the present invention may be only a user equipment, i.e. the user equipment performs corresponding operations; or the user equipment and the network equipment or the server are integrated to form the system, namely the user equipment and the network equipment are matched to execute corresponding operations.

It should be noted that the user equipment, the network device, the network, etc. are only examples, and other existing or future computer devices or networks may also be included in the scope of the present invention, and are included by reference.

It should be noted that the "generating apparatus" according to the present invention may be included in various devices (e.g., an input device), various applications (e.g., an input method), or an apparatus including various applications (e.g., an apparatus included in an input method). The generating device of the present invention may be pre-installed in the computer device by a manufacturer or a vendor of the computer device, or may be loaded from a server to the computer device by the computer device. It should be understood by those skilled in the art that any means which can be used to implement the functions of the present invention, whether loaded into a computer device or not, is within the scope of the present invention.

Here, it should be understood by those skilled in the art that the present invention can be applied to both mobile terminals and non-mobile terminals, for example, when a user uses a mobile phone or a PC, the method or the apparatus of the present invention can be used for providing and presenting.

Specific structural and functional details disclosed herein are merely representative and are provided for purposes of describing example embodiments of the present invention. The present invention may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.

It is to be understood that, although the terms "," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms.

As used herein, the singular forms "," "" are intended to include the plural unless the context clearly indicates otherwise, it is also understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be noted that in alternative implementations, the functions/acts noted may occur out of the order noted in the figures, for example, two figures shown in succession may in fact be executed substantially concurrently or the acts may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

The present invention is described in further detail with reference to the attached figures.

Fig. 1 shows a schematic diagram of generation devices for generating short videos according to aspects of the invention, wherein the generation devices comprise an acquisition device 1, a subtitle determination device 2, a special effect determination device 3 and a video generation device 4.

Specifically, the acquisition device 1 acquires or more pictures and or more voice messages of or more pictures from a user, the caption determination device 2 determines caption information corresponding to the voice messages according to the content of the voice messages, the special effect determination device 3 determines a display special effect corresponding to the pictures and/or the caption information according to the voice features and/or semantic features of the voice messages, and the video generation device 4 generates a short video from the pictures and the voice messages according to the caption information and the display special effect.

The obtaining device 1 obtains or more pictures and or more voice messages of the or more pictures from the user.

Specifically, the obtaining device 1 can obtain or more pictures by calling a built-in default picture, obtaining a picture provided by a user through uploading or shooting and the like, searching the picture from a network through a searching mode, downloading the picture through a downloading mode and the like .

The obtaining device 1 obtains or more voice messages of the or more pictures from the user by means of real-time recording or calling historical recording, and the like, and those skilled in the art can understand that pictures can correspond to or more voice messages, and voice messages can also correspond to or more pictures.

For example, the user selects built-in pictures, then presses a recording key to record voice messages, so that the voice messages correspond to the pictures, and then records voice messages, so that the two voice messages both correspond to the pictures.

Or, for example, the user downloads two pictures from the web and then selects them at the same time and associates them with another pieces of voice information, which is then associated with both pictures at the same time.

The caption determining means 2 determines caption information corresponding to the voice information based on the content of the voice information.

Specifically, the caption determining means 2 recognizes the voice content of the voice information by voice recognition, and then determines the text corresponding to the recognized voice content as the caption information corresponding to the voice information.

Preferably, the caption determining device 2 may further determine whether to divide the caption information into lines, etc., in combination with the length of the voice information; the caption determining device 2 may determine the content of punctuation, lines, etc. in the caption information according to the voice content of the voice information and by combining the voice characteristics, such as tone, rhythm, etc., in the voice information; the subtitle determining apparatus 2 may further interact with the user to provide a proofreading input function for the user, so that the user can proofread the subtitle information.

Preferably, the caption determining apparatus 2 may perform analysis of voice characteristics, voice length, and the like by itself; or interacting with the special effect determining device 3, and iteratively adjusting content such as lines, punctuations and the like of the subtitle information according to feedback of the special effect determining device 3 on speech features/semantic features/display special effects and the like.

The special effect determining device 3 determines the display special effect corresponding to the picture and/or the subtitle information according to the voice feature and/or the semantic feature of the voice information.

Specifically, the special effect determination device 3 analyzes the voice information to determine the voice feature and/or semantic feature of the voice information.

Wherein the voice features include, but are not limited to, tone, rhythm, timbre, etc.; for example, through waveform analysis of the voice information, the sound height change and/or rhythm of the voice information are known; through the analysis of the frequency spectrum and/or the voice spectrum of the voice information, the tone of the voice information is known, such as roughness, fineness, depression, milk smell, crispness and the like; since the tone, rhythm, tone, etc. of the voice information are constantly changing, the change of the tone of the user, such as suddenly increasing the volume or decreasing the volume, can also be determined according to the change.

The semantic features are meanings of the speech of the user. If "i are happy," positive emotions are expressed, "this is not good," negative emotions are expressed, and so on.

Then, the special effect determining apparatus 3 selects or more of preset special effects as the display special effects corresponding to the picture and/or the subtitle information according to the voice features and/or semantic features of the voice information, or acquires the display special effects corresponding to the picture and/or the subtitle information by interacting with a server or other third party devices.

The display special effect comprises a display special effect acting on the picture, a display special effect acting on the subtitle information or a display special effect simultaneously acting on the picture and the subtitle information. The exhibition special effect includes, but is not limited to, a static effect and/or a dynamic effect. Wherein, the static effect such as font, color, etc. acting on the subtitle information, and the static effect such as adding decorative picture, adding decorative text, adding picture texture, picture color change, etc. acting on the picture. The dynamic effects include, but are not limited to, fading, floating, blinking, and the like.

For example, if the voice characteristics of the user indicate that the user speaks in a large scale and a small scale, the displayed subtitles will change with the increase of and the decrease of , and if the voice characteristics of the user indicate that the user speaks in a milk tone and a milk tone, the wakame doll subtitles are displayed.

For example, the semantic features of the user are analyzed, and if the user says 'love you', hearts of flashing flashing appear on the picture or the subtitle, and if the user says 'good night', the emoticon and the gradient masking layer gradually turn into black to achieve the effect of turning off the light.

Preferably, the special effect determining device 3 determines the display special effect corresponding to the picture and/or the subtitle information according to the voice feature and/or the semantic feature of the voice information and according to the picture feature of the picture.

In particular, the special effect determination apparatus 3 may further analyze the picture to determine picture characteristics of the picture, wherein the picture characteristics include, but are not limited to, picture name, picture description, picture color, dynamic information (such as gif dynamic picture), picture content, and the like.

Then, the special effect determining device 3 determines the display special effect corresponding to the picture and/or the subtitle information by comprehensively considering the above factors according to the voice feature and/or the semantic feature of the voice information and according to the picture feature of the picture.

For example, if the picture feature is that the picture is darker in color and the user's voice feature is that the tone is light, then the determined display special effect is: the subtitles are shown in a jumping form, and the subtitles are added with a lighter outline, and the like.

For example, if the picture features that a heart-shaped pattern is already included in the picture, when the user says "love you", the "heart with flashes flashes appearing on the picture or subtitle" is no longer taken as the special effect of the display, but the rose is added on the picture.

Preferably, the special effect determining device 3 determines the display special effect corresponding to the picture and/or the subtitle information according to the voice feature and/or the semantic feature of the voice information and by combining the voice length of the voice information.

Specifically, the voice length is a duration of the voice message. The special effect determination means 3 may determine a presentation special effect that fits the speech feature and/or the semantic feature, taking into account the speech length.

For example, if the semantic length of voices is 3 seconds, and times of a dynamic show special effect loop requires 5 seconds, the show special effect is not used, whereas if times of a dynamic show special effect loop requires 3 seconds, the voice feature and/or the semantic feature may be combined to determine whether to select the show special effect.

And the video generating device 4 generates the short video from the picture and the voice information according to the subtitle information and the display special effect.

Specifically, the video generation apparatus 4 generates short videos including voices and images from the picture and the voice information , and adds the subtitle information and the special effect of exhibition to the short videos, for example, if the images are moving images, short videos including subtitle information and special effect of exhibition can be generated, and the playing process of the pictures is similar to the voice process , if the images are still images, short videos including subtitle information and special effect of exhibition can be generated, the pictures can be used as a background, the dynamic ones are the subtitle information and the special effect of exhibition, and the dynamic part is performed along with the voice process.

The short video may be saved, collected, transmitted, etc.

Preferably, the generating device further comprises a history obtaining device (not shown), wherein the history obtaining device obtains or more pieces of history voice information of the user and determines a user voice feature library corresponding to the user, the special effect determining device 3 determines the voice feature corresponding to the voice information according to the user voice feature library, and determines the display special effect corresponding to the picture and/or the subtitle information according to the voice feature and/or the semantic feature of the voice information.

Specifically, the history acquisition device acquires or more pieces of history voice information of the user by directly interacting with the user or interacting with other devices capable of providing the history voice information of the user to acquire or more pieces of history voice information of the user.

Then, the history obtaining device establishes a user voice feature library corresponding to the user according to the history voice information, for example, by analyzing and counting a plurality of history voice information, a common tone, an unusual tone, a rhythm, etc. of the user are obtained, so as to establish the user voice feature library corresponding to the user.

Then, the special effect determination device 3 may determine the voice feature corresponding to the current voice information of the user by matching or comparing the current voice of the user with the user voice feature library according to the user voice feature library.

Then, the special effect determining device 3 determines the display special effect corresponding to the picture and/or the caption information according to the voice feature and/or the semantic feature of the voice information.

Preferably, the generating device further comprises a transferring device (not shown) and an adding device (not shown), wherein the transferring device transfers the short videos into or more application usable formats according to the relevant configuration information of the application corresponding to the short videos, and the adding device adds the short videos into the application in the application usable formats.

Specifically, the unloading device can determine or more application available formats required by the application according to the relevant configuration information of the application corresponding to the short video, and unload the short video into the application available formats, for example, if the application is an input method, the short video can be unloaded into a dynamic picture format as a dynamic picture expression, and if the application is a microblog, the short video can be unloaded into or more available video formats as a short video for sending.

Then, the adding device adds the short video in the application available format for the user to make subsequent calls.

Fig. 2 shows a schematic diagram of generation devices for generating short videos according to preferred embodiments of the present invention, wherein the generation devices include an acquisition device 1 ', a subtitle determination device 2', a special effect determination device 3 ', a video generation device 4', a related picture determination device 5 ', and a related video generation device 6'.

Specifically, the obtaining device 1 'obtains or more pictures and or more voice information of or more pictures from a user, the caption determining device 2' determines caption information corresponding to the voice information according to the content of the voice information, the special effect determining device 3 'determines a display special effect corresponding to the pictures and/or the caption information according to the voice characteristics and/or semantic characteristics of the voice information, the video generating device 4' generates short videos from the pictures and the voice information according to the caption information and the display special effect, the related picture determining device 5 'determines or more related pictures related to the pictures according to the voice information and the pictures, and the related video generating device 6' generates or more related short videos from the pictures, the related pictures and the voice information according to the caption information and the display special effect.

The related picture determining means 5' determines related pictures associated with the picture according to the voice information and the picture.

Specifically, the related picture determining device 5' determines related pictures related to the voice information and the picture in terms of content or features according to the voice features and/or semantic features of the voice information and combining with the picture features of the picture.

Wherein the voice features include, but are not limited to, tone, rhythm, timbre, etc.; the semantic features are meanings of the speech of the user. The picture features include, but are not limited to, picture name, picture description, picture color, motion information (e.g., gif motion picture), picture content, and the like.

Wherein the related picture is associated with the voice information/the picture either in content or in subject; alternatively, the related picture is related to the picture in hue, and the like.

For example, if the voice message is: "very much bar! If the picture is "applause", other pictures with a theme of "very stick" may be recommended, such as "tilted thumb", "cheering", etc., or applause pictures with a theme of different characters or similar pictures, such as "tasky applause", "tasky rabbit", "AC funny", etc., may be selected.

For example, in the following examples, only the related pictures with the theme of , such as "bosch drum palm" and "bosch zhuang zhan", may be selected as the preferable related pictures, or the pictures with similar color tone in the related pictures, such as the pictures with the same background color or theme color, may be selected as the preferable related pictures.

The related video generating device 6' generates or more related short videos from the picture, the related picture and the voice information according to the caption information and the display special effect.

Specifically, the related video generating device 6 ' may generate or more related short videos according to the display special effect determined by the special effect determining device 3 ' for the picture by using the subtitle information and the display special effect, and by using the picture, the related picture and the voice information, or the related video generating device 6 ' may resend the related picture, the voice information and the picture to the subtitle determining device 2 ' (as shown in fig. 2) so that the subtitle determining device 2 ' redetermines the display special effect for the above contents, where a method for determining the display special effect is the same as or similar to the determining method of the corresponding device in fig. 1, and thus, the details are not repeated here.

Then, the related video generating device 6' generates or more related short videos from the picture, the related picture, and the voice information according to the subtitle information and the display special effect.

The related short video may be a video corresponding to "subtitle information, display special effect, pieces of the related pictures, and the voice information", that is, the subtitle information, the display special effect, and the voice information are respectively added to a related picture to generate a related short video;

the related short videos can also correspond to 'subtitle information, a display special effect, a plurality of related pictures and the voice information', namely the subtitle information, the display special effect and the voice information are added into the plurality of related pictures, so that the plurality of related pictures can be continuously played to form dynamic related short videos;

the related short videos may also correspond to "caption information, a special display effect, the picture, and or more of the related pictures and the voice information", that is, the picture, or more of the related pictures are taken as pictures to be processed, and the caption information, the special display effect, and the voice information are added to the pictures to be processed, so that the multiple pictures to be processed can be continuously played to form dynamic related short videos, and the like.

Preferably, the related picture determining means 5' comprises a number determining unit (not shown) for determining the number of related pictures associated with the picture and an association determining unit (not shown) for determining or more related pictures associated with the picture based on the speech information, the picture and the number of related pictures.

Specifically, the number determining unit determines the number of related pictures associated with the picture by a preset mode, or more preferably, determines the number of related pictures associated with the picture based on or more modes as follows:

-determining, from the speech length of the speech information, a number of relevant pictures associated with the picture: for example, if the voice length is 5 seconds, the number of the related pictures is determined to be 5; if the voice length is 10 seconds, the number of the related pictures is determined to be 10;

-determining, from the speech features of the speech information, a number of relevant pictures associated with the picture: for example, if the voice feature indicates that 2 or more voice tone conversions (e.g., high pitch to low pitch, low pitch to high pitch, etc.) have occurred, the determined number of related pictures is increased;

-determining, from semantic features of the speech information, a number of relevant pictures associated with the picture: for example, if the semantic features show that the semantic features include a plurality of semantic keywords, different related pictures can be determined according to different keywords, and therefore, the number of related pictures is increased.

The association determining unit determines or more related pictures which are associated with the voice information and the pictures in terms of content or characteristics and meet the requirements of the number of the related pictures on the basis of the voice characteristics and/or semantic characteristics of the voice information and the picture characteristics of the pictures.

Specifically, in step S1, the generating device obtains or more pictures and or more voice messages of the or more pictures from the user, in step S2, the generating device determines caption information corresponding to the voice messages according to the content of the voice messages, in step S3, the generating device determines a presentation special effect corresponding to the pictures and/or the caption information according to the voice features and/or semantic features of the voice messages, and in step S4, the generating device generates a short video from the pictures and the voice messages according to the caption information and the presentation special effect.

In step S1, the generating means acquires or more pictures and or more pieces of voice information of the or more pictures by the user.

Specifically, in step S1, the generating device may obtain or more pictures by invoking built-in default pictures, obtaining pictures provided by the user through uploading or shooting, searching pictures from the network through searching, downloading pictures through downloading, etc. .

In step S1, the generating device obtains or more voice messages of the or more pictures from the user by real-time recording or calling historical recording, etc. here, it should be understood by those skilled in the art that pictures may correspond to or more voice messages, voice messages may also correspond to or more pictures.

In step S2, the generating means determines subtitle information corresponding to the speech information based on the content of the speech information.

Specifically, in step S2, the generating means recognizes the speech content of the speech information by speech recognition, and then determines the text corresponding to the recognized speech content as the subtitle information corresponding to the speech information.

Preferably, in step S2, the generating device may further determine whether to divide the subtitle information into lines, etc., in combination with the length of the voice information; in step S2, the generating device may determine the content of punctuation, lines, etc. in the subtitle information according to the voice content of the voice information and in combination with the voice features, such as tone, rhythm, etc., in the voice information; the generating device can also interact with the user to provide a proofreading input function for the user so as to facilitate the user to proofread the subtitle information.

Preferably, in step S2, the generating means may perform analysis of voice features, voice length, etc. by itself; the content of lines, punctuation, etc. of the subtitle information may also be iteratively adjusted according to the feedback of the step S3 on the speech feature/semantic feature/display special effect, etc. in addition to the feedback obtained from the execution result of the step S3.

In step S3, the generating device determines a special display effect corresponding to the picture and/or the subtitle information according to the voice feature and/or the semantic feature of the voice information.

Specifically, in step S3, the generating device analyzes the voice information to determine the voice feature and/or semantic feature of the voice information.

Then, in step S3, the generating device selects or more from preset special effects as the showing special effects corresponding to the picture and/or the subtitle information according to the voice features and/or semantic features of the voice information, or acquires the showing special effects corresponding to the picture and/or the subtitle information by interacting with a server or other third-party devices.

Preferably, in step S3, the generating device determines a special display effect corresponding to the picture and/or the subtitle information according to the voice feature and/or the semantic feature of the voice information and according to the picture feature of the picture.

Specifically, in step S3, the generating device may further analyze the picture to determine picture characteristics of the picture, wherein the picture characteristics include, but are not limited to, picture name, picture description, picture color, dynamic information (such as gif dynamic picture), picture content, and the like.

Then, in step S3, the generating device determines the special display effect corresponding to the picture and/or the subtitle information by comprehensively considering the above factors according to the voice feature and/or the semantic feature of the voice information and according to the picture feature of the picture.

Preferably, in step S3, the generating device determines a special display effect corresponding to the picture and/or the subtitle information according to the voice feature and/or the semantic feature of the voice information and by combining the voice length of the voice information.

Specifically, the voice length is a duration of the voice message. In step S3, the generating device may determine a special effect of the presentation that conforms to the speech feature and/or the semantic feature based on considering the speech length.

In step S4, the generating device generates a short video from the picture and the voice information according to the subtitle information and the display special effect.

Specifically, in step S4, the generating means generates short videos including voices and images from the picture and the voice information , and adds the subtitle information and the special effect of exhibition to the short videos, for example, if the images are moving images, short videos including subtitle information and special effect of exhibition can be generated, the playing process of the pictures is , if the images are still images, short videos including subtitle information and special effect of exhibition can be generated, the pictures can be used as a background, the subtitle information and special effect of exhibition can be dynamic, and the dynamic part is performed along with the progress of the voices.

The short video may be saved, collected, transmitted, etc.

Preferably, the method further includes step S7 (not shown), in which in step S7, the generating device obtains or more pieces of historical voice information of the user and determines a user voice feature library corresponding to the user, in step S3, the generating device determines a voice feature corresponding to the voice information according to the user voice feature library, and in step S3, determines a special display effect corresponding to the picture and/or the subtitle information according to the voice feature and/or a semantic feature of the voice information.

Specifically, in step S7, the generating device obtains or more pieces of historical speech information of the user by directly interacting with the user or interacting with other devices capable of providing the historical speech information of the user to obtain or more pieces of historical speech information of the user.

Then, in step S7, the generating device builds a user voice feature library corresponding to the user according to the historical voice information, for example, by analyzing and counting a plurality of pieces of historical voice information, the user' S common tone, non-common tone, timbre, rhythm, etc. are obtained, so as to build the user voice feature library corresponding to the user.

Then, in step S7, the generating device may determine the voice feature corresponding to the current voice information of the user by matching or comparing the current voice of the user with the user voice feature library according to the user voice feature library.

Then, in step S3, the generating device determines a special display effect corresponding to the picture and/or the subtitle information according to the voice feature and/or the semantic feature of the voice information.

Preferably, the method further comprises step S8 (not shown) and step S9 (not shown), wherein in step S8, the generating device restores the short videos to or more application-usable formats according to the relevant configuration information of the application to which the short videos correspond, and in step S9, the generating device adds the short videos to the application in the application-usable format.

Specifically, in step S8, the generating device may determine or more available application formats required by the application according to the relevant configuration information of the application corresponding to the short video, and may convert the short video into the available application formats, for example, if the application is an input method, the short video may be converted into a dynamic picture format to serve as a dynamic picture expression, and if the application is a microblog, the short video may be converted into or more available video formats to serve as the short video for transmission.

Then, in step S9, the generating device adds the short video in the application-usable format for subsequent invocation by the user.

Specifically, in step S1 ', the generating device obtains or more pictures and or more voice information of the or more pictures from the user, in step S2', the generating device determines caption information corresponding to the voice information according to the content of the voice information, in step S3 ', the generating device determines a display special effect corresponding to the pictures and/or the caption information according to the voice characteristics and/or semantic characteristics of the voice information, in step S3', the generating device generates short videos from the pictures and the voice information according to the caption information and the display special effect, in step S5 ', the generating device determines or more related pictures related to the pictures according to the voice information and the pictures, and in step S6', the generating device generates or more related short videos from the pictures, the related pictures and the voice information according to the caption information and the display special effect.

In step S5', the generating device determines or more related pictures associated with the picture according to the voice information and the picture.

Specifically, in step S5', the generating device determines or more related pictures related to the voice information and the picture in terms of content or features according to the voice features and/or semantic features of the voice information and combining with the picture features of the picture.

Preferably, in step S5', the generating device may further select steps of preferred related pictures from the selected related pictures, for example, in the following example, only related pictures with the subject of may be selected as preferred related pictures, such as "rabbit-base drum palm" and "rabbit-base praise", or pictures with similar color tones in the related pictures may be selected as preferred related pictures, such as having the same background color or subject color.

In step S6', the generating device generates or more related short videos from the pictures, the related pictures and the voice information according to the subtitle information and the display special effect.

Specifically, in step S6 ', the generating device may generate or more related short videos from the subtitle information and the display special effect according to the display special effect determined for the picture in step S3 ', or in step S6 ', the generating device may re-execute step S2 ' (as shown in fig. 2) on the related picture, the voice information, and the picture, so that the step S2're-determines the display special effect for the above contents, where a method for determining the display special effect is the same as or similar to the determining method of the corresponding device in fig. 3, and thus, the details are not repeated here.

Then, in step S6', the generating device generates or more related short videos from the pictures, the related pictures and the voice information according to the subtitle information and the special effects of the presentation.

Preferably, the step S5 ' includes a step S51 ' (not shown) and a step S52 ' (not shown), wherein in the step S51 ', the generating means determines a number of related pictures associated with the picture, and in the step S52 ', the generating means determines or more related pictures associated with the picture according to the voice information, the picture and the number of related pictures.

Specifically, in step S51', the generating device determines the number of related pictures associated with the picture by a preset manner, or more preferably, determines the number of related pictures associated with the picture based on one or more of the following manners:

In step S52', the generating device determines or more related pictures meeting the related picture quantity requirement, which are related to the voice information and the picture in terms of content or features, based on the voice features and/or semantic features of the voice information and the picture features of the pictures.

It should be noted that the present invention can be implemented in software and/or a combination of software and hardware, for example, an Application Specific Integrated Circuit (ASIC), a general purpose computer or any other similar hardware device in embodiments, the software program of the present invention can be executed by a processor to implement the steps or functions described above, as such, the software program of the present invention (including associated data structures) can be stored in a computer readable recording medium, for example, RAM memory, magnetic or optical drive or diskette and similar device, and furthermore, the steps or functions of the present invention can be implemented in hardware, for example, as a circuit that cooperates with the processor to perform the steps or functions.

Furthermore, portions of the present invention may be implemented as a computer program product, such as computer program instructions, which when executed by a computer may invoke or provide methods and/or aspects in accordance with the present invention through operation of the computer, whereas program instructions invoking the methods of the present invention may be stored on a fixed or removable recording medium and/or transmitted via a data stream over or other signal bearing media, and/or stored in a working memory of a computer device operating in accordance with the program instructions, herein embodiments in accordance with the present invention include apparatus comprising a memory for storing the computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform methods and/or aspects in accordance with various embodiments of the present invention as previously described.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof , the scope of the invention is thus to be construed as illustrative and not restrictive, and the appended claims are not to be construed as limited to the foregoing description, and therefore all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1, A method for generating a short video, wherein the method comprises the steps of:

d, generating a short video from the picture and the voice information according to the subtitle information and the display special effect;

and generating or a plurality of related short videos by using the pictures, the related pictures and the voice information according to the subtitle information and the display special effect.

2. The method of claim 1, wherein the step c comprises:

3. The method of claim 1, wherein the step c comprises:

4. The method of any of claims 1-3, wherein step x comprises:

x1 determining a number of related pictures associated with the picture;

5. The method of claim 4, wherein the step x1 includes at least any of :

6. The method of any of claims 1-3, wherein the method further comprises:

wherein the step c comprises:

7. The method of any of claims 1-3, wherein the method further comprises:

-adding the short video in the application usable format.

8. The method of of any one of claims 1-3, wherein the show effects include or more dynamic effects.

a generating device for generating a short video, wherein the generating device comprises:

the video generating device is used for generating the short video from the picture and the voice information according to the subtitle information and the display special effect;

10. The generation apparatus of claim 9, wherein the special effects determination apparatus is to:

11. The generation apparatus of claim 9, wherein the special effects determination apparatus is to:

12. The generation apparatus according to any of claims 9-11, wherein the related picture determination means includes:

13. The generation apparatus according to claim 12, wherein the number determination unit is configured to determine at least any of :

14. The generation apparatus of any of claims 9 to 11, wherein the generation apparatus further comprises:

wherein the special effect determining means is configured to:

15. The generation apparatus of any of claims 9 to 11, wherein the generation apparatus further comprises:

adding means for adding the short video in the application-usable format.

16. The generation apparatus of of any of claims 9 to 11, wherein the show effects include or more dynamic effects.

17, input device comprising the generating apparatus of any of claims 9-16, .