CN114120992A - Method and device for generating video through voice, electronic equipment and computer readable medium - Google Patents

Method and device for generating video through voice, electronic equipment and computer readable medium Download PDF

Info

Publication number
CN114120992A
CN114120992A CN202010906384.9A CN202010906384A CN114120992A CN 114120992 A CN114120992 A CN 114120992A CN 202010906384 A CN202010906384 A CN 202010906384A CN 114120992 A CN114120992 A CN 114120992A
Authority
CN
China
Prior art keywords
voice
video
type
information
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010906384.9A
Other languages
Chinese (zh)
Inventor
付平非
王宇嘉
杨乐
潘世光
杨杰豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202010906384.9A priority Critical patent/CN114120992A/en
Publication of CN114120992A publication Critical patent/CN114120992A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The disclosure provides a method and a device for generating a video by voice, electronic equipment and a computer readable medium, and relates to the technical field of video production. A method of speech generating video, comprising: acquiring a voice function starting instruction to start a voice function corresponding to the voice function starting instruction; acquiring voice information; recognizing the corresponding semantics of the voice information according to the voice information; and generating the first type video according to the semantics. According to the technical scheme, the voice information is acquired after the voice function starting instruction is acquired, the corresponding function can be prevented from being started when the video does not need to be generated, the video is automatically generated according to the semantics, the generated video meets the requirements of a user, the user does not need to manually search for video materials, the user does not need to manually make the video, the video is generated conveniently and quickly, and the video generating efficiency is high.

Description

Method and device for generating video through voice, electronic equipment and computer readable medium
Technical Field
The embodiment of the disclosure relates to the technical field of video production, in particular to a method and a device for generating a video by voice, an electronic device and a computer readable medium.
Background
With the development of the internet and intelligent terminals, more and more users use the terminals to make videos and share the videos to other people so as to obtain attention, click rate or fan-shaped videos. For example, in a short video viewing platform, a user may publish a short video to publish the short video for others to view.
When the video is released, the video needs to be made. In the prior art, when a user makes a video, the user needs to make the video manually, for example, manually shoot the video, or manually search for a video material, and then make the video, the time spent on making the video is long, and the video making efficiency is low.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, a method for generating a video by speech is provided, the method comprising:
acquiring a voice function starting instruction to start a voice function corresponding to the voice function starting instruction;
acquiring voice information;
recognizing the semantics corresponding to the voice information according to the voice information;
and generating a first type video according to the semantics.
In a second aspect, there is also provided an apparatus for speech generating video, the apparatus comprising:
the instruction acquisition module is used for acquiring a voice function starting instruction so as to start a voice function corresponding to the voice function starting instruction;
the voice acquisition module is used for acquiring voice information;
the voice recognition module is used for recognizing the semantics corresponding to the voice information according to the voice information;
and the video generation module is used for generating a first type video according to the semantics.
In a third aspect, an electronic device is also provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: a method of speech generating video as illustrated in the first aspect of the present disclosure is performed.
In a fourth aspect, there is also provided a computer readable medium, on which a computer program is stored, which program, when executed by a processor, implements the method of speech generating video shown in the first aspect of the present disclosure.
Compared with the prior art, the embodiment of the disclosure provides a method, a device, an electronic device and a computer readable medium for generating a video by voice, the voice information is acquired only after the voice function starting instruction is acquired, the corresponding function can be prevented from being started when the video does not need to be generated, the video is automatically generated according to the semantics, the video is not only related to the keywords or the keywords in the voice information, but also related to the whole semantics corresponding to the voice information, the generated video conforms to the use scene expressed by the user semantics, the generated video conforms to the requirements of various aspects in the user semantics, the user does not need to manually search video materials, the user does not need to manually make the video, the video generation is convenient and fast, and the video generation efficiency is high.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
Fig. 1 is a schematic flow chart of a method for generating a video by using speech according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a position of a display interface of a virtual control a in a terminal according to an embodiment of the present disclosure;
FIG. 3 is a detailed flowchart of step S104 in FIG. 1;
FIG. 4 is a schematic diagram of an interface for displaying voice prompt information according to an embodiment of the present disclosure;
fig. 5 is a schematic flow chart of a method for generating a video by using speech according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an apparatus for generating a video by speech according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device for generating a video by using speech according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing the devices, modules or units, and are not used for limiting the devices, modules or units to be different devices, modules or units, and also for limiting the sequence or interdependence relationship of the functions executed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
The present disclosure provides a method, an apparatus, an electronic device and a medium for generating a video by using speech, which aim to solve the above technical problems in the prior art.
The following describes the technical solutions of the present disclosure and how to solve the above technical problems in specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.
As will be appreciated by those skilled in the art, the "terminal" used in the present disclosure may be a Mobile phone, a tablet computer, a PDA (Personal Digital Assistant), an MID (Mobile Internet Device), or the like.
Referring to fig. 1, an embodiment of the present disclosure provides a method for generating a video by using voice, where the method for generating a video by using voice is applicable to a terminal, and the method includes:
step S101: and acquiring a voice function starting instruction so as to start the voice function corresponding to the voice function starting instruction.
Referring to fig. 2, the voice function start command may be triggered by the user operating the terminal. Specifically, the voice function starting instruction may be triggered by a user pressing a preset virtual control for a preset duration. The preset time period is not limited, and for example, the preset time period may be 1s, 2s, or 2.5 s. When the preset duration is 2s, if the user presses the virtual control for more than or equal to 2s, the terminal can acquire a voice function starting instruction. Specifically, for example, an app (Application, mobile phone software) for shooting a short video, a virtual control a is arranged at the middle lower part of a display interface of the app, and a user can obtain a voice function starting instruction by pressing the virtual control a with a finger for a time exceeding a preset time. After the terminal starts the voice function, the terminal can acquire the voice information.
Step S102: and acquiring voice information.
After the terminal starts the voice function, if the user inputs voice information, namely the user speaks, the terminal can acquire the voice information corresponding to the user speaking. If the user's utterance is that "i want to make a trip to a Chongqing as a checkpoint video", the terminal can acquire the voice information that "i want to make a trip to a Chongqing as a checkpoint video".
Step S103: and identifying the corresponding semantics of the voice information according to the voice information.
If the voice information is 'i want to do a trip to a Chongqing as a checkpoint video', the terminal can analyze the semantics of the voice information to acquire the semantics of the voice information, and the terminal can determine what the user wants to do and what type of video is needed. The specific scheme for obtaining the semantics corresponding to the voice information is the prior art, and the disclosure is not further described.
Step S104: and generating the first type video according to the semantics.
Generating a first type of video from semantics may include: and acquiring a video material corresponding to the semantics, and generating a first type of video according to the video material.
If the voice information is 'i want to do a click video for the travel going to the Chongqing', and the semantics corresponding to the voice information are obtained, the video material corresponding to the semantics can be locally obtained at the terminal. "I want to make a trip to a Chongqing as a checkpoint video", the corresponding video material is a photo of the trip to the Chongqing and audio preset locally at the terminal. When a user takes a picture, the obtained picture can comprise picture basic information, for example, the picture basic information comprises location information and time information when the picture is taken, the terminal obtains the picture basic information, the picture with the picture basic information conforming to the semantic that 'I want to take a trip to a Chongqing as a checkpoint video' is taken as a video material, a preset audio is obtained as the video material, and a video is generated according to the picture and the audio. The preset audio frequencies can be multiple, and when one audio frequency is selected as the video material, one audio frequency can be randomly selected as the video material, or one audio frequency conforming to the semantic meaning is selected as the video material. Specifically, if the mood of the user is judged according to the semantics and the mood of the user is cheerful, the cheerful type audio is selected as the video material.
According to the method for generating the video through the voice, the voice information is acquired after the voice function starting instruction is acquired, the corresponding function can be prevented from being started when the video does not need to be generated, the video is automatically generated according to the semantics, the video is not only related to keywords or keywords in the voice information, but is related to the whole semantics corresponding to the voice information, the generated video conforms to the use scene expressed by the user semantics, the generated video conforms to the requirements of multiple aspects in the user semantics, the user does not need to manually search for video materials, the user does not need to manually make the video, the video generation is convenient and fast, and the video generation efficiency is high.
Referring to fig. 3, optionally, generating a first type of video according to semantics includes:
s301: and determining the video dimensionality of the generated video according to the semantics, wherein the video dimensionality comprises at least two of time, location, people, scenic spots, emotion, voice-to-text effect, audio, scene and special effect.
In the present disclosure, video temperatures include, but are not limited to including, time, location, people, objects, sights, emotions, voice-to-text effects, audio, scenes, special effects.
The time, the place and the person can be the time, the place and the person corresponding to the video material, and the object can be an object of the video material, such as a photo. The voice-to-text effect may be a letter effect in the produced video. The scene may be a scene of a material, and if the object is a photograph and the scene is a grassland, the photograph corresponding to the grassland is obtained. The special effect may be an effect added to the video, such as using a filter. After the voice corresponding to the voice information is obtained, the dimensionality of the video can be determined. If the voice information is 'help me make a photo of a cat shot yesterday as a photo movie', after the semantics of the voice information are determined, the dimensions of the video including time 'yesterday', a person 'cat' and an object 'photo' can be determined according to the voice.
S302: and acquiring video materials according to the video dimensionality.
The video material is pre-stored in the terminal. The video materials can be collected and generated in daily life of the user, such as photos shot by the user. The type of video material is not limited and may include, for example, photographs, audio, text, and the like. The video material includes preset basic information. The basic information of the video material corresponds to the video dimensions. After the video dimensionality is determined, the video material can be obtained according to the video dimensionality. If the video dimensions include time "yesterday", person "cat", and object "photo", it is necessary to acquire a photo of the cat taken yesterday to use the acquired photo as a video material, that is, to acquire the video material corresponding to the video dimensions according to the video dimensions and the basic information corresponding to the video material.
Optionally, preset video material, such as preset audio, may also be obtained. The preset audio frequencies can be multiple, when one audio frequency is selected as the video material, one audio frequency can be randomly selected as the video material, or one audio frequency conforming to the semantic meaning is selected as the video material.
S303: and generating a first type video according to the video material and the semantics.
After the video material is obtained, the first type video can be generated according to the video material and the semantics.
If the voice information is ' helping me to make yesterday's photo as a photo movie, and a special effect is added to my image ', when the video dimension is ' yesterday ', the character ' i ' and the object ' photo ' are obtained, the corresponding material, namely the photo of my yesterday, is obtained according to the video dimension, and in the generated video, the special effect of A is also added to the ' i ' image.
According to the scheme of the embodiment of the disclosure, at least two video dimensions are considered when the video is generated, and the generated video meets the requirements of users.
Optionally, after the voice function start instruction is obtained, the method for generating a video by voice further includes:
and displaying voice prompt information, wherein the voice prompt information is used for prompting a user to input voice information according to the voice prompt information.
Referring to fig. 4, after viewing the voice prompt message, the user can input the voice message according to the voice prompt message. And voice prompt information, which provides a reference for the user to prompt the user to input the voice information according to a preset format. The voice prompt information is used for prompting the user to input the voice information according to the voice prompt information, and the voice prompt information can be used for prompting the user to input the voice information in a format imitating the voice prompt information. The user inputs the voice information according to the voice prompt information, so that the terminal can more easily identify the voice information of the user, the semantics corresponding to the voice information can be easily distinguished, and the corresponding video material can be quickly obtained. For example, the voice prompt message may be "i want to make a trip to a celebration as a click video", the user may input the voice message "i want to make a trip to beijing as a click video", and if the user does not input the voice message according to the voice prompt message but speaks at will, the corresponding first type video may not be generated according to the voice message.
The specific type of the voice prompt message is not limited. Alternatively, the voice guidance information may include a variety of guidance information for causing the terminal to perform different functions. Optionally, the voice prompt information includes first type information and second type information, the first type information is used for generating a first type video, and the second type information is used for implementing other functions.
Optionally, the voice prompt information includes first type information, second type information and third type information, and the semantics corresponding to the voice information are identified according to the voice information; generating a first type of video according to the semantics, comprising:
when the type of the voice information is the first type voice, recognizing the semantic meaning corresponding to the voice information according to the voice information; and generating a first type video according to the semantics.
Namely, a voice function starting instruction is obtained to start a voice function corresponding to the voice function starting instruction, and voice information is obtained; displaying voice prompt information, wherein the voice prompt information is used for prompting a user to input voice information according to the voice prompt information; and when the type of the voice information is the first type voice, the steps of steps S103 and S104 in the present disclosure are performed to generate the first type video.
Referring to fig. 5, after acquiring the voice information, the method for generating a video by voice further includes:
s501: and determining the type of the voice information according to a preset language model corresponding to the voice prompt information, wherein the type of the voice information comprises a first type voice, a second type voice and a third type voice.
The language model corresponds to the voice prompt information. The language model can be obtained by training a plurality of preset voice samples. The first type voice and the second type voice respectively correspond to a plurality of voice new books, and the language model is obtained by training according to the first type voice and the second type voice respectively corresponding to a plurality of voice samples. The voice information can be determined to be the first type voice or the second type voice according to a preset language model corresponding to the voice prompt information, and when the voice information does not belong to the first type voice and the second type voice, the voice information is determined to be the third type voice.
Optionally, determining the type of the voice message according to a preset language model corresponding to the voice prompt message includes:
determining the voice information as a first type voice or a second type voice according to a preset language model corresponding to the voice prompt information;
when the voice does not belong to the first type voice and the second type voice, the voice information is determined as a third type voice.
When the type of the voice information is a first type voice, recognizing the semantics corresponding to the voice information according to the voice information; and generating a first type video according to the semantics.
S502: and when the type of the voice information is the second type voice, starting the function of the corresponding voice information according to the voice information.
The second type of voice is used to turn on the function of the corresponding voice message. If the voice information is 'I want to try on the latest and most hot album', the terminal opens the album mode and sets the latest and most hot album to the top so that the user can watch the preview of the most hot album; if the voice information is 'I want to try on the latest prop', the terminal opens the front camera and opens the latest prop so that the user can use the latest prop to take a picture by oneself. The specific voice information corresponding to the second type voice is not limited, and the terminal can start the function corresponding to the voice information. It can be understood that after the terminal starts the function corresponding to the voice information, the user can directly and manually operate the terminal to use the corresponding function, or the user continues to make voice, so that the terminal uses the corresponding function according to the voice information corresponding to the voice. How the user of the terminal uses the corresponding function is not limited in this disclosure.
S503: and when the type of the voice information is the third type voice, generating a second type video according to the voice information.
Optionally, generating the second type of video according to the voice information may include:
acquiring characters corresponding to the voice information;
and generating a second type video according to the preset audio corresponding to the characters and the third type voice.
After the voice information is obtained, the characters corresponding to the voice information can be obtained through analysis. If the obtained voice information corresponding to the third type of voice is 'i happy today', the character 'i happy today' corresponding to the voice information can be obtained.
The third type of speech may be preset to correspond to one or more audio frequencies. Audio is music. When the second video is generated, one or more audios corresponding to the third type may be selected as video materials to generate the second type of video. It can be understood that when the second video is played, the user can watch the characters corresponding to the spoken voice information and also can hear the corresponding audio.
Optionally, generating the second type of video according to the voice information may include:
generating a dynamic wave-front corresponding to the voice information frequency according to the voice information;
acquiring a head portrait of a preset account of a user;
and generating a second type video according to the dynamic pop, the head portrait and a preset audio corresponding to the third type voice.
The dynamic pops are dynamic curves that change following the frequency change of the voice information. The voice information is formed by the voice of the user speaking, which has a frequency, and thus a dynamic pop can be generated.
And presetting an account number which is the account number of the application program logged in the terminal. Optionally, the scheme of the present disclosure is executed by an application running on the terminal. The user can have the account of the application program, the head portrait corresponding to the account is displayed when the user logs in the account, and the terminal can acquire the head portrait of the account of the user.
And when the video is generated, generating a second type video according to the dynamic pop, the head portrait and the third type voice. It can be understood that when the second video is played, the user can watch the dynamic pop corresponding to the spoken voice information and the head portrait of the account number of the user, and can also hear the corresponding audio.
The above is merely an example of generating the second type of video according to the third type of voice, and in the present disclosure, how to generate the second type of video according to the third type of voice is not limited.
The technical scheme disclosed by the invention can determine the type of the voice information according to the voice information of the user so as to execute different functions, diversifies the voice function, determines the type of the voice information according to the preset language model corresponding to the voice prompt information, and is high in speed for recognizing the voice type.
Referring to fig. 6, an embodiment of the present disclosure provides an apparatus 60 for generating a video by speech, where the apparatus 60 for generating a video by speech can implement the method for generating a video by speech according to the above embodiment, and the method may include:
the instruction obtaining module 601 is configured to obtain a voice function starting instruction to start a voice function corresponding to the voice function starting instruction;
a voice obtaining module 602, configured to obtain voice information;
the voice recognition module 603 is configured to recognize a semantic corresponding to the voice information according to the voice information;
a video generating module 604, configured to generate a first type of video according to semantics.
According to the device for generating the video through the voice, the voice information is acquired after the voice function starting instruction is acquired, the corresponding function can be prevented from being started when the video does not need to be generated, the video is automatically generated according to the semantics, the video is not only related to keywords or keywords in the voice information, but is related to the whole semantics corresponding to the voice information, the generated video conforms to the use scene expressed by the user semantics, the generated video conforms to the requirements of multiple aspects in the user semantics, the user does not need to manually search for video materials, the user does not need to manually make the video, the video generation is convenient and fast, and the video generation efficiency is high.
The video generation module 604 may include:
the dimension acquiring unit is used for determining and generating video dimensions of the video according to the semantics, wherein the video dimensions comprise at least two of time, places, people, objects, scenic spots, emotions, voice-to-text effects, audios, scenes and special effects;
the material acquisition unit is used for acquiring video materials according to video dimensions;
and the first video generation unit is used for generating a first type video according to the video material and the semantics.
The apparatus 60 for generating video by speech may further include:
and the prompt display module is used for displaying voice prompt information, and the voice prompt information is used for prompting a user to input voice information according to the voice prompt information.
The voice recognition module 603 is specifically configured to, when the type of the voice information is the first type of voice, recognize a semantic meaning corresponding to the voice information according to the voice information.
The apparatus 60 for generating video by speech may further include:
the voice type determining module is used for determining the type of the voice information according to a preset language model corresponding to the voice prompt information, wherein the type of the voice information comprises a first type voice, a second type voice and a third type voice;
the function starting module is used for starting the function of the corresponding voice information according to the voice information when the type of the voice information is the second type voice;
and the video module is used for generating a second type video according to the voice information when the type of the voice information is a third type voice.
The video module may include:
the character acquisition unit is used for acquiring characters corresponding to the voice information;
a second video generation unit for generating a second type video according to the preset audio corresponding to the characters and the third type voice
Or a video module, may include:
the apparatus comprises a wave spectrum acquiring unit, a wave spectrum acquiring unit and a voice information processing unit, wherein the wave spectrum acquiring unit is used for generating dynamic wave spectrum corresponding to voice information frequency according to the voice information;
the head portrait acquiring unit is used for acquiring a head portrait of a preset account of a user;
and the third video generation unit is used for generating a second type video according to the dynamic pop, the head portrait and the preset audio corresponding to the third type voice.
The voice type determining module may include:
the first determining unit is used for determining the voice information as a first type voice or a second type voice according to a preset language model corresponding to the voice prompt information;
and a second determining unit configured to determine the voice information as a third type of voice when the voice does not belong to the first type of voice and the second type of voice.
Referring to fig. 7, a schematic diagram of an electronic device 700 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in the drawings is only an example and should not bring any limitation to the functions and use range of the embodiments of the present disclosure.
The electronic device includes: a memory and a processor, wherein the processor may be referred to as the processing device 701 hereinafter, and the memory may include at least one of a Read Only Memory (ROM)702, a Random Access Memory (RAM)703 and a storage device 708 hereinafter, as shown in detail below:
as shown, the electronic device 700 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage means 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While the figures illustrate an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a voice function starting instruction to start a voice function corresponding to the voice function starting instruction; acquiring voice information; recognizing the corresponding semantics of the voice information according to the voice information; and generating the first type video according to the semantics.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules or units described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module or a unit does not in some cases constitute a limitation of the unit itself, and for example, the instruction acquisition module may also be described as a "module that acquires external information".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In accordance with one or more embodiments of the present disclosure, there is provided a method of speech generating a video, including:
acquiring a voice function starting instruction to start a voice function corresponding to the voice function starting instruction;
acquiring voice information;
recognizing the corresponding semantics of the voice information according to the voice information;
and generating the first type video according to the semantics.
According to one or more embodiments of the present disclosure, generating a first type of video according to semantics includes:
determining video dimensions of a generated video according to semantics, wherein the video dimensions comprise at least two of time, location, people, objects, scenic spots, emotion, voice-to-text effect, audio, scene and special effect;
acquiring a video material according to video dimensions;
and generating a first type video according to the video material and the semantics.
According to one or more embodiments of the present disclosure, after the voice function start instruction is acquired, the method further includes:
and displaying voice prompt information, wherein the voice prompt information is used for prompting a user to input voice information according to the voice prompt information.
According to one or more embodiments of the present disclosure, the voice guidance information includes first type information, second type information, and third type information; and when the type of the voice information is the first type voice, identifying the corresponding semantics of the voice information according to the voice information.
According to one or more embodiments of the present disclosure, after acquiring the voice information, the method for generating a video by voice further includes:
determining the type of the voice information according to a preset language model corresponding to the voice prompt information, wherein the type of the voice information comprises a first type voice, a second type voice and a third type voice;
when the type of the voice information is a second type voice, starting the function of the corresponding voice information according to the voice information;
and when the type of the voice information is the third type voice, generating a second type video according to the voice information.
According to one or more embodiments of the present disclosure, generating a second type of video from voice information includes:
acquiring characters corresponding to the voice information;
generating a second type video according to the preset audio corresponding to the characters and the third type voice; or
Generating a second type of video from the voice information, comprising:
generating a dynamic wave-front corresponding to the voice information frequency according to the voice information;
acquiring a head portrait of a preset account of a user;
and generating a second type video according to the dynamic pop, the head portrait and a preset audio corresponding to the third type voice.
According to one or more embodiments of the present disclosure, determining the type of the voice message according to a preset language model corresponding to the voice prompt message includes:
determining the voice information as a first type voice or a second type voice according to a preset language model corresponding to the voice prompt information;
when the voice does not belong to the first type voice and the second type voice, the voice information is determined as a third type voice.
According to one or more embodiments of the present disclosure, there is provided an apparatus for generating a video by speech, including:
the instruction acquisition module is used for acquiring a voice function starting instruction so as to start a voice function corresponding to the voice function starting instruction;
the voice acquisition module is used for acquiring voice information;
the voice recognition module is used for recognizing the corresponding semantics of the voice information according to the voice information;
and the video generation module is used for generating a first type video according to the semantics.
According to one or more embodiments of the present disclosure, a video generation module may include:
the dimension acquiring unit is used for determining and generating video dimensions of the video according to the semantics, wherein the video dimensions comprise at least two of time, places, people, objects, scenic spots, emotions, voice-to-text effects, audios, scenes and special effects;
the material acquisition unit is used for acquiring video materials according to video dimensions;
and the first video generation unit is used for generating a first type video according to the video material and the semantics.
According to one or more embodiments of the present disclosure, an apparatus for generating a video by speech may further include:
and the prompt display module is used for displaying voice prompt information, and the voice prompt information is used for prompting a user to input voice information according to the voice prompt information.
According to one or more embodiments of the present disclosure, the speech recognition module is specifically configured to, when the type of the speech information is the first type of speech, recognize a semantic meaning corresponding to the speech information according to the speech information.
According to one or more embodiments of the present disclosure, the apparatus for speech generating a video may further include:
the voice type determining module is used for determining the type of the voice information according to a preset language model corresponding to the voice prompt information, wherein the type of the voice information comprises a first type voice, a second type voice and a third type voice;
the function starting module is used for starting the function of the corresponding voice information according to the voice information when the type of the voice information is the second type voice;
and the video module is used for generating a second type video according to the voice information when the type of the voice information is a third type voice.
According to one or more embodiments of the present disclosure, a video module may include:
the character acquisition unit is used for acquiring characters corresponding to the voice information;
a second video generation unit for generating a second type video according to the preset audio corresponding to the characters and the third type voice
Or a video module, may include:
the apparatus comprises a wave spectrum acquiring unit, a wave spectrum acquiring unit and a voice information processing unit, wherein the wave spectrum acquiring unit is used for generating dynamic wave spectrum corresponding to voice information frequency according to the voice information;
the head portrait acquiring unit is used for acquiring a head portrait of a preset account of a user;
and the third video generation unit is used for generating a second type video according to the dynamic pop, the head portrait and the preset audio corresponding to the third type voice.
According to one or more embodiments of the present disclosure, the voice type determination module may include:
the first determining unit is used for determining the voice information as a first type voice or a second type voice according to a preset language model corresponding to the voice prompt information;
and a second determining unit configured to determine the voice information as a third type of voice when the voice does not belong to the first type of voice and the second type of voice.
According to one or more embodiments of the present disclosure, there is provided an electronic device including:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: a method of speech generating video according to any of the above embodiments is performed.
According to one or more embodiments of the present disclosure, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements the method of speech generating video of any of the above-described embodiments.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (10)

1. A method of speech generating video, comprising:
acquiring a voice function starting instruction to start a voice function corresponding to the voice function starting instruction;
acquiring voice information;
recognizing the semantics corresponding to the voice information according to the voice information;
and generating a first type video according to the semantics.
2. The method of speech generating video of claim 1, wherein said generating a first type of video according to said semantics comprises:
determining video dimensions of a generated video according to the semantics, wherein the video dimensions comprise at least two of time, location, people, objects, scenic spots, emotion, voice-to-text effect, audio, scene and special effect;
acquiring a video material according to the video dimension;
and generating a first type video according to the video material and the semantics.
3. The method of claim 1, wherein after the obtaining the voice function opening instruction, the method further comprises:
and displaying voice prompt information, wherein the voice prompt information is used for prompting a user to input voice information according to the voice prompt information.
4. The method of speech-generating video of claim 3, wherein the voice prompt information comprises a first type of information, a second type of information, and a third type of information;
recognizing the semantics corresponding to the voice information according to the voice information; generating a first type of video according to the semantics, comprising:
when the type of the voice information is the first type voice, recognizing the semantic meaning corresponding to the voice information according to the voice information; generating a first type video according to the semantics;
after the voice information is acquired, the method further comprises:
determining the type of the voice information according to a preset language model corresponding to the voice prompt information, wherein the type of the voice information comprises a first type voice, a second type voice and a third type voice;
when the type of the voice information is the second type voice, starting a function corresponding to the voice information according to the voice information;
and when the type of the voice information is the third type voice, generating a second type video according to the voice information.
5. The method of speech generating video of claim 4, wherein said generating a second type of video from the speech information comprises:
acquiring characters corresponding to the voice information;
generating a second type video according to the characters and a preset audio corresponding to the third type voice; or
The generating of the second type of video according to the voice information comprises:
generating a dynamic wave-front corresponding to the voice information frequency according to the voice information;
acquiring a head portrait of a preset account of a user;
and generating a second type video according to the dynamic pop, the head portrait and a preset audio corresponding to the third type voice.
6. The method for generating video according to claim 4, wherein the determining the type of the voice message according to a preset language model corresponding to the voice prompt message comprises:
determining the voice information to be first type voice or second type voice according to a preset language model corresponding to the voice prompt information;
and when the voice does not belong to the first type voice and the second type voice, determining the voice information as a third type voice.
7. The method of speech generating video of claim 1, wherein: the voice function starting instruction is triggered by a user pressing a preset virtual control for a preset time.
8. An apparatus for speech generating video, comprising:
the instruction acquisition module is used for acquiring a voice function starting instruction so as to start a voice function corresponding to the voice function starting instruction;
the voice acquisition module is used for acquiring voice information;
the voice recognition module is used for recognizing the semantics corresponding to the voice information according to the voice information;
and the video generation module is used for generating a first type video according to the semantics.
9. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: a method of performing speech-generated video according to any of claims 1-7.
10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method of speech generating video according to any one of claims 1 to 7.
CN202010906384.9A 2020-09-01 2020-09-01 Method and device for generating video through voice, electronic equipment and computer readable medium Pending CN114120992A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010906384.9A CN114120992A (en) 2020-09-01 2020-09-01 Method and device for generating video through voice, electronic equipment and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010906384.9A CN114120992A (en) 2020-09-01 2020-09-01 Method and device for generating video through voice, electronic equipment and computer readable medium

Publications (1)

Publication Number Publication Date
CN114120992A true CN114120992A (en) 2022-03-01

Family

ID=80360458

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010906384.9A Pending CN114120992A (en) 2020-09-01 2020-09-01 Method and device for generating video through voice, electronic equipment and computer readable medium

Country Status (1)

Country Link
CN (1) CN114120992A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101577114A (en) * 2009-06-18 2009-11-11 北京中星微电子有限公司 Method and device for implementing audio visualization
CN103366742A (en) * 2012-03-31 2013-10-23 盛乐信息技术(上海)有限公司 Voice input method and system
CN106328164A (en) * 2016-08-30 2017-01-11 上海大学 Ring-shaped visualized system and method for music spectra
CN107172485A (en) * 2017-04-25 2017-09-15 北京百度网讯科技有限公司 A kind of method and apparatus for being used to generate short-sighted frequency
CN109741738A (en) * 2018-12-10 2019-05-10 平安科技(深圳)有限公司 Sound control method, device, computer equipment and storage medium
CN109889849A (en) * 2019-01-30 2019-06-14 北京市商汤科技开发有限公司 Video generation method, device, medium and equipment
CN110580912A (en) * 2019-10-21 2019-12-17 腾讯音乐娱乐科技(深圳)有限公司 Music visualization method, device and system
CN111462736A (en) * 2019-01-17 2020-07-28 北京字节跳动网络技术有限公司 Image generation method and device based on voice and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101577114A (en) * 2009-06-18 2009-11-11 北京中星微电子有限公司 Method and device for implementing audio visualization
CN103366742A (en) * 2012-03-31 2013-10-23 盛乐信息技术(上海)有限公司 Voice input method and system
CN106328164A (en) * 2016-08-30 2017-01-11 上海大学 Ring-shaped visualized system and method for music spectra
CN107172485A (en) * 2017-04-25 2017-09-15 北京百度网讯科技有限公司 A kind of method and apparatus for being used to generate short-sighted frequency
CN109741738A (en) * 2018-12-10 2019-05-10 平安科技(深圳)有限公司 Sound control method, device, computer equipment and storage medium
CN111462736A (en) * 2019-01-17 2020-07-28 北京字节跳动网络技术有限公司 Image generation method and device based on voice and electronic equipment
CN109889849A (en) * 2019-01-30 2019-06-14 北京市商汤科技开发有限公司 Video generation method, device, medium and equipment
CN110580912A (en) * 2019-10-21 2019-12-17 腾讯音乐娱乐科技(深圳)有限公司 Music visualization method, device and system

Similar Documents

Publication Publication Date Title
CN110971969B (en) Video dubbing method and device, electronic equipment and computer readable storage medium
RU2640632C2 (en) Method and device for delivery of information
KR20220103110A (en) Video generating apparatus and method, electronic device, and computer readable medium
CN112040330B (en) Video file processing method and device, electronic equipment and computer storage medium
CN111970571B (en) Video production method, device, equipment and storage medium
EP4024880A1 (en) Video generation method and apparatus, electronic device, and computer readable medium
CN113257218B (en) Speech synthesis method, device, electronic equipment and storage medium
WO2021088790A1 (en) Display style adjustment method and apparatus for target device
US11818491B2 (en) Image special effect configuration method, image recognition method, apparatus and electronic device
CN111629156A (en) Image special effect triggering method and device and hardware device
CN112291614A (en) Video generation method and device
CN112069360A (en) Music poster generation method and device, electronic equipment and medium
CN113886612A (en) Multimedia browsing method, device, equipment and medium
CN114584716A (en) Picture processing method, device, equipment and storage medium
CN113923390A (en) Video recording method, device, equipment and storage medium
CN112764600B (en) Resource processing method, device, storage medium and computer equipment
CN112243157A (en) Live broadcast control method and device, electronic equipment and computer readable medium
CN111767259A (en) Content sharing method and device, readable medium and electronic equipment
CN111327960A (en) Article processing method and device, electronic equipment and computer storage medium
CN114120992A (en) Method and device for generating video through voice, electronic equipment and computer readable medium
CN115269920A (en) Interaction method, interaction device, electronic equipment and storage medium
CN114398135A (en) Interaction method, interaction device, electronic device, storage medium, and program product
CN114528433A (en) Template selection method and device, electronic equipment and storage medium
CN110366002B (en) Video file synthesis method, system, medium and electronic device
CN113360704A (en) Voice playing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Applicant after: Tiktok vision (Beijing) Co.,Ltd.

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Applicant before: BEIJING BYTEDANCE NETWORK TECHNOLOGY Co.,Ltd.

Address after: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Applicant after: Douyin Vision Co.,Ltd.

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Applicant before: Tiktok vision (Beijing) Co.,Ltd.