CN112885318A

CN112885318A - Multimedia data generation method and device, electronic equipment and computer storage medium

Info

Publication number: CN112885318A
Application number: CN201911199131.6A
Authority: CN
Inventors: 邓俊祺
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2021-06-01

Abstract

The embodiment of the invention provides a multimedia data generation method and device, electronic equipment and a computer storage medium. The multimedia data generation method comprises the following steps: carrying out energy spectrum and difference spectrum analysis on the frequency spectrum of the obtained human voice audio to be processed, and determining human voice syllable information in the human voice audio to be processed according to an analysis result; processing the voice syllables indicated by the voice syllable information according to the beat information of the accompaniment audio to generate target voice audio matched with the accompaniment audio; and synthesizing the target voice audio and the accompaniment audio to generate multimedia data. By the embodiment of the invention, the applicability can be improved.

Description

Multimedia data generation method and device, electronic equipment and computer storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a multimedia data generation method and device, electronic equipment and a computer storage medium.

Background

With the development and maturity of science and technology and internet technology, the types and functions of application programs are increasingly enriched. For example, an instant messaging application program for users to perform daily communication and exchange, a playing application program for providing multimedia resources such as videos, and the like. And the karaoke application programs include but are not limited to desktop PC-based application programs, WeChat-based platform applets, intelligent mobile device-based application programs, web application based on web technology and the like, and are deeply loved by the majority of users because of having the functions of recording and publishing songs and having the social function. However, in the conventional karaoke applications, when a user generates multimedia data, such as audio data, the user usually records the user's voice during playing the accompaniment, and then synthesizes the recorded voice and the accompaniment according to the time stamp, thereby generating a song.

This makes the mode that application generated the song comparatively single, and the user needs sing along with the accompaniment synchronization, can not be suitable for in some inconvenient scenes that make a sound or the scene that the environmental sound is comparatively noisy, leads to the suitability relatively poor.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a multimedia data generation scheme to solve some or all of the above problems.

According to a first aspect of embodiments of the present invention, there is provided a multimedia data generation method, including: carrying out energy spectrum and difference spectrum analysis on the frequency spectrum of the obtained human voice audio to be processed, and determining human voice syllable information in the human voice audio to be processed according to an analysis result; processing the voice syllables indicated by the voice syllable information according to the beat information of the accompaniment audio to generate target voice audio matched with the accompaniment audio; and synthesizing the target voice audio and the accompaniment audio to generate multimedia data.

According to a second aspect of the embodiments of the present invention, there is provided a multimedia data generation method, including: carrying out energy spectrum and difference spectrum analysis on the frequency spectrum of the obtained human voice audio to be processed, and determining human voice syllable information in the human voice audio to be processed according to an analysis result; acquiring node information for voice matching in accompaniment audio and/or audio-free video data, and processing voice syllables indicated by the voice syllable information according to the node information to generate target voice audio; and synthesizing at least one of the accompaniment audio and the video data with the target voice audio to generate multimedia data.

According to a third aspect of the embodiments of the present invention, there is provided a multimedia data processing method, including: acquiring audio data containing human voice audio according to triggering operation; acquiring multimedia data generated according to the audio data, wherein the multimedia data is obtained by identifying voice syllable information of the audio data and processing voice syllables indicated by the voice syllable information according to node information used for voice matching in accompaniment audio and/or audio-free video data to obtain target voice audio and synthesizing at least one of the accompaniment audio and the audio-free video data with the target voice audio; and providing the multimedia data on a display interface.

According to a fourth aspect of the embodiments of the present invention, there is provided a multimedia data generation apparatus including: the first analysis module is used for carrying out spectrum energy spectrum and difference spectrum analysis on the acquired human voice audio to be processed and determining human voice syllable information in the human voice audio to be processed according to an analysis result; the first target voice generation module is used for processing voice syllables indicated by the voice syllable information according to the beat information of the accompaniment audio so as to generate target voice audio matched with the accompaniment audio; and the first synthesis module is used for synthesizing the target voice audio and the accompaniment audio to generate multimedia data.

According to a fifth aspect of the embodiments of the present invention, there is provided a multimedia data generation apparatus including: the second analysis module is used for carrying out spectrum energy spectrum and difference spectrum analysis on the acquired human voice audio to be processed and determining human voice syllable information in the human voice audio to be processed according to an analysis result; the second target voice generation module is used for acquiring node information used for voice matching in accompaniment audio and/or audio-free video data, and processing voice syllables indicated by the voice syllable information according to the node information to generate target voice audio; and the second synthesis module is used for synthesizing at least one of the accompaniment audio and the video data with the target voice audio to generate multimedia data.

According to a sixth aspect of the embodiments of the present invention, there is provided a multimedia data processing apparatus including: the acquisition module is used for acquiring audio data containing human voice audio and data to be synthesized according to the triggering operation; the multimedia acquisition module is used for acquiring multimedia data generated according to the audio data, processing the voice syllables indicated by the voice syllable information by identifying the voice syllable information of the audio data and according to node information used for voice matching in the video data of the accompaniment audio and/or the non-audio to acquire target voice audio, and synthesizing at least one of the video data of the accompaniment audio and the non-audio with the target voice audio; and the display module is used for providing the multimedia data on a display interface.

According to a seventh aspect of the embodiments of the present invention, there is provided an electronic apparatus including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform an operation corresponding to the multimedia data generation method according to the first aspect to the second aspect or an operation corresponding to the multimedia data processing method according to the third aspect.

According to an eighth aspect of embodiments of the present invention, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the multimedia data generation method according to the first to second aspects or the multimedia data processing method according to the third aspect.

According to the multimedia data generation scheme provided by the embodiment of the invention, when the human voice audio is analyzed to obtain the human voice syllable information, the energy spectrum and the difference spectrum of the frequency spectrum of the human voice audio are used comprehensively, so that the analysis is more comprehensive, and the obtained human voice syllable information is more accurate, so that the human voice syllable indicated by the human voice syllable information can be processed subsequently according to the beat information of the accompaniment audio to generate the target human voice audio, and the target human voice audio and the accompaniment audio are synthesized to generate the multimedia data. Can realize like this that the rhythm to the vocal syllable of target vocal audio frequency matches with the rhythm of the accompaniment audio frequency to the messenger has solved and has relied on the user to oneself to match the accompaniment rhythm among the prior art, and the user must record the problem of sound at the in-process of broadcast accompaniment, thereby has avoided service environment's restriction, has promoted the suitability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and it is also possible for a person skilled in the art to obtain other drawings based on the drawings.

Fig. 1a is a flowchart illustrating steps of a multimedia data generating method according to a first embodiment of the present invention;

fig. 1b is a usage scenario diagram of a multimedia data generating method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a multimedia data generating method according to a second embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps of a method for generating multimedia data according to a third embodiment of the present invention;

fig. 4 is a flowchart illustrating steps of a multimedia data generating method according to a fourth embodiment of the present invention;

FIG. 5 is a flowchart of the steps of a multimedia data generating method using scenes according to the present invention;

FIG. 6a is a flowchart illustrating steps of a method for generating multimedia data according to a fifth embodiment of the present invention;

fig. 6b is a usage scenario diagram of a multimedia data generating method according to a fifth embodiment of the present invention;

fig. 7 is a flowchart illustrating steps of a method for generating multimedia data according to a sixth embodiment of the present invention;

fig. 8 is a flowchart illustrating steps of a method for generating multimedia data according to a seventh embodiment of the present invention;

FIG. 9a is a flowchart illustrating steps of a method for processing multimedia data according to an eighth embodiment of the present invention; a

FIG. 9b is a diagram illustrating an interface change in a usage scenario of a multimedia data processing method according to an eighth embodiment of the present invention;

FIG. 9c is a diagram illustrating an interface change in a usage scenario of another multimedia data processing method according to an embodiment eight of the present invention;

fig. 10 is a block diagram of a multimedia data generating apparatus according to a ninth embodiment of the present invention;

fig. 11 is a block diagram showing a configuration of a multimedia data generating apparatus according to a tenth embodiment of the present invention;

FIG. 12 is a block diagram of a multimedia data processing apparatus according to an eleventh embodiment of the present invention;

fig. 13 is a schematic structural diagram of an electronic device according to a twelfth embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention shall fall within the scope of the protection of the embodiments of the present invention.

The following further describes specific implementation of the embodiments of the present invention with reference to the drawings.

Example one

Referring to fig. 1a, a flowchart illustrating steps of a multimedia data generating method according to a first embodiment of the present invention is shown.

The multimedia data generation method of the embodiment comprises the following steps:

step S102: and carrying out energy spectrum and difference spectrum analysis on the frequency spectrum of the obtained human voice audio to be processed, and determining human voice syllable information in the human voice audio to be processed according to an analysis result.

The pending human voice audio may include human voice audio recorded by the user in real time or pre-recorded human voice audio uploaded by the user, and the recording environment may be any environment. The human voice contained in the human voice audio may be in any language (e.g., dialect) or a sound containing only some combinations of phonemes (e.g., a human-simulated animal vocals). A phoneme is understood to be a sound with a certain meaning that is given by a human vocal organ.

The frequency spectrum of the human voice audio may be a mel-frequency spectrum, and the mel-frequency spectrum may be a frequency spectrum transformed by inputting the human voice audio into a mel-scale filter bank (mel-scale filter banks). The energy spectrum of the human voice audio may be used to indicate the signal energy per unit frequency. The differential spectrum of the human voice audio may be a spectrum obtained by performing a differential operation on a mel spectrum.

By analyzing the energy spectrum and the difference spectrum in the Mel frequency spectrum, the human voice syllable information in the human voice frequency can be obtained. The human voice syllable information is used to indicate individual syllables in the human voice. When the human voice audio is analyzed, the energy spectrum and the difference spectrum in the Mel frequency spectrum of the human voice audio are comprehensively used, so that the accuracy of the syllables indicated in the human voice syllable information obtained by analysis is higher, and the subsequent processing of the syllables is facilitated.

The human syllable information can contain different information according to different requirements. For example, in the present embodiment, the human voice syllable information includes time information of a human voice syllable and consonant proportion information of the human voice syllable.

Wherein the time information of each human voice syllable can be expressed in the form of [ st, et ], st indicates the start time, i.e., the start time of the human voice syllable in the human voice audio, and et indicates the end time, i.e., the end time of the human voice syllable in the human voice audio.

The consonant proportion information indicates a consonant part and a vowel part in a human voice syllable, and the consonant part of the human voice syllable is generally in front of the vowel part. For example, the time information of a certain vocal syllable is [1:00,1: 10, the consonant proportion information is 60%, the consonant part is [1:00,1: 06], and the vowel part is [1:06,1:10 ].

Of course, in other embodiments, the vocal syllable information may also include other information, which is not limited in this embodiment.

Step S104: and processing the voice syllables indicated by the voice syllable information according to the rhythm information of the accompaniment audio so as to generate a target voice audio matched with the accompaniment audio.

The accompaniment audio may be audio uploaded by a user or audio selected from a library of accompaniments. Tempo information of accompaniment audio includes, but is not limited to, beats per minute (i.e., BPM).

Those skilled in the art can appropriately process the vocal syllables as needed as long as the target vocal audio matching the accompaniment audio can be generated.

For example, a syllable extraction process, a note quantization process, a tempo matching process, and the like are performed on a human vocal bar. The syllable extraction process is used to obtain syllable segments of each human voice syllable from the human voice audio, and any suitable method can be adopted for processing. The note quantization process is used to quantize the vocal syllables into corresponding notes, and the quantization mode can be determined according to requirements. The tempo matching process is used to correspond the vocal syllables to the tempos of the accompaniment audio, and any suitable matching method may be used.

In the present embodiment, the target human voice audio refers to audio sounded by human in a musical piece (such as a rap piece) in addition to the accompaniment audio. It may be a pure human voice audio including only human voice, or may be an audio including human voice and other background sounds, as long as it is an audio matching the tempo of the accompaniment audio.

Step S106: and synthesizing the target voice audio and the accompaniment audio to generate multimedia data.

In this embodiment, the target human voice audio and the accompaniment audio may be synthesized in any suitable manner. For example, the target human voice audio and the accompaniment audio are directly superimposed, or the target human voice audio and/or the accompaniment audio are subjected to sound effect processing and then superimposed, and the like.

If the multimedia data to be generated is audio data, a synthesized audio obtained by synthesizing the target human voice audio and the accompaniment audio can be directly used as the generated multimedia data.

The following describes an implementation process of the multimedia data generation method with reference to a specific usage scenario.

As shown in fig. 1b, the multimedia data generating method may be executed by a synthesizing module, and the synthesizing module may be configured at a server (the server includes a server and/or a cloud), or may be configured locally at the terminal device.

In the use scenario, the synthesis module obtains to-be-processed voice audio (hereinafter referred to as voice audio) including voice of a user, and performs energy spectrum analysis and differential spectrum analysis on the voice audio to obtain voice syllable information, where the voice syllable information includes time information of voice syllables in the to-be-processed voice audio. Due to the adoption of the comprehensive analysis mode, the time information of the voice syllables analyzed is more accurate.

In addition, the synthesis module processes the human voice syllables in the human voice audio according to the rhythm information of the accompaniment audio and obtains the target human voice audio matched with the rhythm information. The accompaniment audio can be preset default accompaniment audio and can also be determined according to the operation of a user. The tempo information of the accompaniment audio may be acquired in advance or obtained by analyzing the accompaniment audio.

And then, synthesizing the accompaniment audio and the target human voice audio to generate required multimedia data.

Through the embodiment, when the human voice audio is analyzed to obtain the human voice syllable information, the energy spectrum and the difference spectrum of the frequency spectrum of the human voice audio are comprehensively used, so that the analysis is more comprehensive, the obtained human voice syllable information is more accurate, the subsequent human voice syllables indicated by the human voice syllable information can be processed according to the beat information of the accompaniment audio, the target human voice audio is generated, and the target human voice audio and the accompaniment audio are synthesized to generate the multimedia data. Can realize like this that the rhythm to the vocal syllable of target vocal audio frequency matches with the rhythm of the accompaniment audio frequency to the messenger has solved and has relied on the user to oneself to match the accompaniment rhythm among the prior art, and the user must record the problem of sound at the in-process of broadcast accompaniment, thereby has avoided service environment's restriction, has promoted the suitability.

The multimedia data generation method of the present embodiment may be performed by any suitable electronic device having data processing capabilities, including but not limited to: servers, mobile terminals (such as tablet computers, mobile phones and the like), PCs and the like.

Example two

Referring to fig. 2, a flowchart illustrating steps of a multimedia data generating method according to a second embodiment of the present invention is shown.

The multimedia data generating method of the present embodiment includes the aforementioned steps S102 to S106.

Wherein the step S106 comprises the following substeps:

substep S1061: and respectively carrying out sound effect adding processing on the target human voice audio and the accompaniment audio.

After generating target voice audio frequency, in order to make the audio effect in the multimedia data that generates better, make the voice more full, and have required audio, add the audio to target voice audio frequency and accompaniment audio frequency.

Different sound effects can be added according to different requirements. For example, if the audio in the multimedia data to be generated is a rap audio, audio such as a disc-rubbing audio effect and a tape audio effect may be added to the accompaniment audio, and audio such as a delay audio effect and a loop audio effect may be added to the target human audio.

For another example, if the audio in the multimedia data to be generated is the audio of other styles, a stereo sound effect may be added to the accompaniment audio and the target human voice audio, which is not limited in this embodiment. The following takes the example of generating rap multimedia data, and the process of adding sound effect processing is exemplified:

in one implementation, sound effects are added to audio clips corresponding to pauses in the target human voice audio and the accompaniment audio according to the pauses in the target human voice audio. In order to obtain better hearing effect, different sound effects are added to corresponding audio clips according to different pause durations.

For example, the sub-step S1061 includes the sub-steps S1061a to S1061f of the sub-step S1061, the sub-steps S1061a to S1061c being performed for a longer time pause; for a pause of a shorter time, the sub-steps S1061d through S1061f in the sub-step S1061 are performed.

Of course, in other embodiments, other sound effects may be added according to other criteria, and the embodiment does not limit this.

Specifically, the process of implementing the sub-step S1061 through the sub-steps S1061a through S1061f is as follows:

sub-step S1061 a: and acquiring first pause time information of which the pause time length is greater than a first preset time length in the target human voice audio.

The specific value of the first preset time period may be determined as needed, which is not limited in this embodiment.

The first pause may be a pause in the target human voice audio. Upon acquisition of the first pause, it may be determined from the vocal syllables indicated in the vocal syllable information obtained in step S102.

And if the time interval between two adjacent human voice syllables is greater than a first preset time length, determining the human voice syllables as a first pause. Or, according to the frequency spectrum in the target human voice audio. If the blank time of a certain frequency in the target human voice audio is greater than the first preset time length, the first pause is determined.

The time information of the first pause may include a start time and an end time of the first pause. For example, if the time information of a certain first pause is [1:01,1:04], it indicates that the first pause is 1 min 01 sec to 1 min 04 sec of the target human voice audio.

Of course, the skilled person may use other ways to indicate the first pause as required, and the embodiment does not limit this.

Sub-step S1061 b: and determining at least one corresponding first audio clip from the target human voice audio and at least one corresponding second audio clip from the accompaniment audio according to the time information of the first pause.

One or more than one first pause can be existed in one target human voice audio, and the first audio segment in the target human voice audio corresponding to each first pause is determined according to the time information of each first pause. For example, if the time information of the first pause a is [1:01,1:04], the first audio segment in the time period is obtained from the target human voice audio according to the time information, and if the time information of the first pause B is [1:20,1:25], the other first audio segment in the time period is obtained from the target human voice audio according to the time information.

Since the target human voice audio is generated according to the beat information of the accompaniment audio, the second audio clip of the corresponding time period can be correspondingly acquired from the accompaniment audio according to the time information of the first pause in the target human voice audio. The obtaining method may be the same as the obtaining method of the first audio segment, and therefore, the description is omitted. Of course, in other embodiments, the second audio segment may be obtained in a different manner from the first audio segment, which is not limited in this embodiment.

Sub-step S1061 c: and respectively carrying out sound effect adding processing on the first audio clip and the second audio clip to generate target human voice audio and accompaniment audio after the sound effect is added.

In a specific implementation, the sub-step S1061c includes the following processes:

procedure a 1: according to a preset first sound effect algorithm, carrying out time delay processing on each first audio frequency segment, obtaining a first effect segment corresponding to each first audio frequency segment, and replacing each first audio frequency segment in the target human sound audio frequency with the corresponding first effect segment.

In the present embodiment, the delay effect is added to the first audio piece as an example.

In this case, the preset first sound effect algorithm may be any suitable algorithm for realizing a delay effect (that is, an effect generated by performing delay playing on a section of audio according to a certain rule).

And calculating each first audio clip by using a first sound effect algorithm to obtain a first effect clip which corresponds to the first audio clip and has a delayed playing effect in auditory sense, and replacing the first audio clip in the target human voice audio by the first effect clip, so that the played first effect clip can have a delayed effect in the process of playing the generated multimedia data.

The first sound effect algorithm may be any algorithm that is capable of achieving a time delay effect.

Procedure B1: and carrying out mute processing on each second audio clip according to a preset second sound effect algorithm, acquiring a second effect clip corresponding to each second audio clip, and replacing each second audio clip in the accompaniment audio with the corresponding second effect clip.

In the present embodiment, the addition of the mute effect to the second audio piece is taken as an example for explanation.

In this case, the preset second sound effect algorithm may be an algorithm for realizing a mute effect. And calculating each second audio clip by using a second sound effect algorithm, obtaining second effect clips which are corresponding to the second audio clips and are mute in hearing, and replacing each second audio clip in the accompaniment audio with the corresponding second effect clip. Thus, in the process of playing the generated multimedia data, playing the second effect segment can have a mute effect.

The second sound effect algorithm may be any algorithm that can achieve a muting effect.

Sub-step S106 d: and acquiring second pause time information of which the pause time length in the target human voice audio is less than or equal to a first preset time length and more than a second preset time length.

The value of the second preset duration may be determined as needed, which is not limited in this embodiment. For example, if the second pause corresponds to a shorter pause in the target vocal audio (e.g., a pause in a quarter note), the value of the second predetermined duration may be determined accordingly.

In this embodiment, the manner of determining the second pause from the target human voice audio is the same as the manner of determining the first pause, and therefore, the description thereof is omitted. Of course, in other embodiments, the second pause may be determined in a different manner.

The time information of the second pause may include a start time and an end time of the second pause. For example, if the time information of a certain second pause is [1:01,1:02], it indicates that the second pause is 1 min 01 sec to 1 min 02 sec of the target human voice audio.

Of course, the skilled person may use other ways to indicate the second pause as required, and the embodiment does not limit this.

Sub-step S1061 e: and determining at least one corresponding third audio clip from the accompaniment audio according to the time information of the second pause.

As described above, since the target human voice audio is generated according to the tempo information of the accompaniment audio, according to the time information of the second pause in the target human voice audio, the audio clip of the corresponding time slot in the accompaniment audio, that is, the third audio clip, can be acquired.

Sub-step S1061 f: and performing sound effect adding processing on each third audio clip to obtain corresponding accompaniment audio.

According to different requirements, different sound effect processing can be carried out on the third audio clip. For example, if the multimedia data to be generated is a singing style, a disc rubbing sound effect (i.e., a sound effect having an effect of repeating and accelerating audio playing back and forth) may be added to part or all of the third audio segments as needed, or a tape sound effect (i.e., a sound effect having an effect of accelerating or slowing down audio playing speed) may be added to part or all of the third audio segments, where the tape sound effect is divided into two types, one type is an accelerating tape sound effect for accelerating playing, and the other type is a slowing tape sound effect for slowing down playing.

For example, for a third audio clip a, audio reciprocating processing is performed on the third audio clip to obtain accompaniment audio after sound effect processing. Specifically, a preset dish rolling algorithm is used, and the dish rolling algorithm can achieve the effect of performing audio reciprocating processing on the third audio clip A, calculates the third audio clip A to obtain a third effect clip A which enables the third audio clip A to have a dish rolling effect in the sense of hearing, and replaces the third audio clip A in the accompaniment audio with the third effect clip A to obtain the accompaniment audio after the audio processing.

The disc rolling algorithm can calculate an effect segment for performing acceleration processing on the first half part of the third audio segment a, calculate an effect segment for performing deceleration processing on the second half part of the third audio segment a, and splice the two effect segments to obtain a third effect segment a.

The dish rubbing algorithm may be any algorithm that can achieve a dish rubbing effect.

And/or aiming at a third audio clip B, carrying out playing speed increasing processing on the third audio clip to obtain accompaniment audio subjected to sound effect processing. Specifically, a preset acceleration tape algorithm is used, the acceleration tape algorithm can achieve the effect of playing acceleration processing on the third audio clip B, the third audio clip B is calculated to obtain a third effect clip B which enables the third audio clip B to have the effect of accelerating the tape in hearing, and the third effect clip B is used for replacing the third audio clip B in the accompaniment audio to obtain the accompaniment audio after sound effect processing.

And/or, aiming at a third audio clip C, carrying out play deceleration processing on the third audio clip C to obtain accompaniment audio subjected to sound effect processing. Specifically, a preset slowing-down tape algorithm is used, the slowing-down tape algorithm can achieve the effect of playing and slowing down the third audio clip B, the third audio clip C is calculated to obtain a third effect clip C which enables the third audio clip C to have the effect of slowing down the tape in hearing, and the third effect clip C is used for replacing the third audio clip C in the accompaniment audio to obtain the accompaniment audio after sound effect processing.

The speed-up and speed-down tape algorithms may be any suitable algorithm that achieves the corresponding effect.

Of course, in other embodiments, other sound effects may be added to the first audio piece, the second audio piece, and the third audio piece, for example, a circular sound effect, which is an effect that a section of audio piece is acoustically circulated and the following audio piece is covered when the audio piece is circulated, and the like.

Substep S1062: and synthesizing the target human voice audio and the accompaniment audio after the sound effect is added to generate the multimedia data.

After the sound effect is added to the target human voice audio and the accompaniment audio, other processing can be carried out on the target human voice audio and/or the accompaniment audio according to the quality (such as loudness, definition and the like) of the target human voice audio and/or the quality of the accompaniment audio, and the processed target human voice audio and the accompaniment audio are synthesized to generate multimedia data.

For example, in the present embodiment, in order to make the sound of the audio clearer and have sufficient loudness, the sub-step S1062 includes sub-steps S1062a to S1062 c.

Specifically, the process of performing the sub-step S1062 a-the sub-step S1062c to realize the sub-step S1062 is as follows:

sub-step S1062 a: after the sound effect is added, the loudness of the target human voice audio and the loudness of the accompaniment audio after sound effect processing are calculated, and after the target human voice audio and the loudness of the accompaniment audio are processed according to the sound effect, the loudness of the accompaniment audio is processed, and after the sound effect is processed, the loudness gain processing is carried out on the accompaniment audio after the target human voice audio and the sound effect are processed respectively.

Loudness is the intensity of sound perceived by the human ear. Those skilled in the art can calculate the loudness of the target human voice audio after the sound effect processing and the loudness of the accompaniment audio after the sound effect processing in any suitable manner as required, which is not limited in the embodiment.

The loudness gain processing for the post-effection target human voice audio and the post-effection accompaniment audio may be performed by one skilled in the art in any suitable manner. The loudness gain may be a positive gain (i.e., increasing loudness) or a negative gain (i.e., decreasing loudness), and this embodiment is not limited thereto.

Sub-step S1062 b: and superposing the target human voice audio subjected to gain processing and the accompaniment audio to obtain a synthetic audio.

After gain processing is respectively carried out on the target human voice audio and the accompaniment audio, the target human voice audio and the accompaniment audio are superposed (namely, sound mixing processing is carried out), synthetic audio is obtained, and the target human voice audio and the accompaniment audio are synthesized into one audio.

Sub-step S1062 c: and generating the multimedia data according to the synthesized audio.

In a particular implementation, the sub-step S1062c is implemented as: and carrying out total gain processing on the synthesized audio, and generating the multimedia data according to the synthesized audio after the total gain.

For example, the total gain processing is performed on the synthesized audio by using the limiter, which can also be understood as adding a pressure limit effect to the synthesized audio, that is, processing the dynamic range of the synthesized audio, so that the processed synthesized audio is more full and has a better auditory effect. The threshold effect is an extreme case of the compression effect, and the compression effect is a common dynamic range processing method for audio, so that the dynamic range of the audio is reduced.

If the multimedia data is audio data, the synthesized audio with the total gain can be output as the audio with the accompaniment. If the multimedia data is video data, the synthesized audio with the total gain can be superimposed with the image data to obtain the multimedia data.

Through the embodiment, the first audio clip in the target human voice audio and the second audio clip and/or the third audio clip in the accompaniment audio are/is obtained according to the first pause time information and/or the second pause time information, and sound effect adding processing is carried out on the first audio clip and the third audio clip, so that the sound effects of the target human voice audio and the accompaniment audio are richer, different styles of audio can be synthesized, and the obtained multimedia data are more full and interesting in hearing.

Through the loudness gain processing of the target human voice audio and the accompaniment audio, the loudness of the target human voice audio and the accompaniment audio can be more balanced, the hearing experience of a user is better, and the hearing of the user cannot be damaged due to overhigh loudness.

In addition, when the human voice audio is analyzed to obtain the human voice syllable information, the energy spectrum and the difference spectrum of the frequency spectrum of the human voice audio are comprehensively used, so that the analysis is more comprehensive, the obtained human voice syllable information is more accurate, the human voice syllable indicated by the human voice syllable information can be processed subsequently according to the beat information of the accompaniment audio, the target human voice audio is generated, and the target human voice audio and the accompaniment audio are synthesized to generate the multimedia data. Can realize like this that the rhythm to the vocal syllable of target vocal audio frequency matches with the rhythm of the accompaniment audio frequency to the messenger has solved and has relied on the user to oneself to match the accompaniment rhythm among the prior art, and the user must record the problem of sound at the in-process of broadcast accompaniment, thereby has avoided service environment's restriction, has promoted the suitability.

EXAMPLE III

Referring to fig. 3, a flowchart illustrating steps of a multimedia data generating method according to a third embodiment of the present invention is shown.

The multimedia data generating method of the present embodiment includes the aforementioned steps S102 to S106. The substep S106 may be implemented in the manner described in the second embodiment, or implemented in other manners.

Wherein, step S102 includes the following substeps:

substep S1021: and carrying out noise reduction processing on the acquired original audio.

This step is an optional step, and one skilled in the art can determine whether to perform sub-step S1021 as needed.

The original audio may be any audio including human voice, which may be recorded by the user in real time, or may be pre-recorded audio loaded from a storage space, which is not limited in this embodiment.

After the user inputs the original audio, the original audio is subjected to noise reduction processing to remove the environmental noise in the original audio. The noise reduction can be performed by any suitable means as desired by those skilled in the art, for example, using the webrtc noise reduction method, or a modification thereof.

Substep S1022: and carrying out human voice activity detection on the audio subjected to noise reduction.

This step is an optional step, and one skilled in the art can determine whether to perform sub-step S1022 as needed.

The voiced part of the noise-reduced audio can be obtained by Voice Activity Detection (VAD). So as to carry out subsequent treatment according to the voice part. The person skilled in the art may perform the detection of human voice activity in any suitable way.

Substep S1023: analyzing an energy spectrum of a frequency spectrum of the human voice audio to be processed to obtain a first sequence for indicating each human voice section in the human voice audio to be processed; and analyzing a difference spectrum of the frequency spectrum of the human voice audio to be processed to obtain a second sequence used for indicating each human voice syllable in the human voice audio to be processed.

In one specific implementation, the following process is included:

procedure a 2: and calculating a Mel frequency spectrum of the human voice audio, and then calculating a corresponding energy spectrum and a differential spectrum according to the Mel frequency spectrum, wherein each column in the differential spectrum is subjected to difference, and the difference is directly set to be 0 when a negative number appears.

Procedure B2: the difference spectra are summed column by column to obtain a one-dimensional vector.

Procedure C2: the one-dimensional vector obtained in process B2 is subjected to a gaussian smoothing.

Procedure D2: the results obtained in process C2 were normalized once.

Procedure E2: and (4) performing peak selection (findpeaks) on the result obtained by the process D2, namely finding a reasonable local maximum (peaks) in the one-dimensional vector, and outputting the result as the second sequence.

Procedure F2: a slice process is performed, that is, the energy spectrum (which is one-dimensional) obtained in the process a2 is subjected to a similar peak selection process, but the output of the process is not a local maximum, but a one-dimensional vector isVowels which has the same size as the energy spectrum and only contains 0 and 1. Where 0 indicates "no vowel" and 1 indicates "vowel. Then, the one-dimensional vector isVowels is corrected by the local maximum calculated by the difference spectrum. The correction rules may be determined as desired. For example, in a portion where isVowels is 0, if there is a local maximum, then isVowels is set to 1 within 8 frames before until the frame where isVowels is 1 after the local maximum.

Process G2: according to the output result of the human voice activity detection process, namely the output vocal part (called as regions), the vocal part and isVowels are corrected mutually, namely the starting time and the ending time of the vocal part and the isVowels are compared, the details are adjusted, and the second sequence is generated according to the adjusted isVowels.

Substep S1024: and determining the voice syllable information in the voice audio according to the first sequence and the second sequence.

In a specific implementation, the sub-step S1024 includes the following processes:

process H2: and according to the adjusted isVowels, segmenting the target human voice audio to obtain a note meeting the requirement, calculating a new audio segmentation according to the speech recognition engine after obtaining the note, and within the range of each note, if the number of new audio segmentation points is more than two, segmenting the note again according to the new audio segmentation, otherwise, carrying out no processing.

Procedure I2: and outputting the syllable information of each human voice according to the obtained note information and the adjusted isVowels information.

Wherein, the voice syllable information may include time information and consonant proportion information of the voice syllable. The time information of the vocal syllables may be a sequence of [ st, et ], st indicating the start time and et indicating the end time.

The proportion of consonants in each vocal syllable to the whole vocal syllable can be determined by analyzing the proportion of isVowel of 0 in each note.

Substep S1025: the human voice is preprocessed.

This step is an optional step, and one skilled in the art can determine whether sub-step S1025 is performed as needed.

The pretreatment process may be different according to the need. For example, in the present embodiment, the preprocessing includes compression processing, equalization processing, and dynamic equalization processing. Of course, in other embodiments, the pre-processing may include only the aforementioned partial processing, or may include processing other than the aforementioned processing.

The compression processing is to process the dynamic range of the audio frequency, so that the dynamic range of the audio frequency is reduced, the human voice can be heard more full, and the loudness is relatively larger.

The equalization processing is an operation of performing different gains or attenuations on components of different frequency bands in the audio, and the gain amount and the attenuation amount can be determined according to needs.

The dynamic equalization processing is an operation of performing different gains or attenuations on components of different frequency bands in the audio, and the amount of the gain and the amount of the attenuation can be dynamically changed along with the condition of the audio. The human voice can be more natural through the equalization processing and the dynamic equalization processing.

Through this embodiment, through carrying out the human voice activity detection to the audio frequency after making an uproar, confirm the human voice syllable information of human voice audio frequency according to the result that the human voice activity detected, energy spectrum analysis result and difference spectrum analysis result for the human voice syllable information that obtains is more accurate.

Example four

Referring to fig. 4, a flowchart illustrating steps of a multimedia data generating method according to a fourth embodiment of the present invention is shown.

The multimedia data generating method of the present embodiment includes the aforementioned steps S102 to S106. Step S106 may be implemented in the manner described in embodiment two, or implemented in other manners. Step S102 may be implemented in the manner described in the third embodiment, or implemented in other manners.

Wherein the step S104 comprises the following substeps:

substep S1041: and according to the time information of each voice syllable in the voice syllable information, carrying out audio segmentation processing on the voice audio to obtain syllable segments corresponding to each voice syllable.

For example, the human voice audio is segmented according to the time information, and each syllable is extracted, that is, by cutting the human voice audio, a syllable segment corresponding to the time information of the human voice syllable is obtained.

Substep S1042: and quantizing the syllable segments of the human voice syllables into corresponding musical notes.

In one embodiment, the extracted syllable segments are quantized to notes according to a predetermined quantization rule. E.g., 4-point note, 8-point note, 16-point note, etc.

Since each syllable segment has a corresponding length, the median of the lengths of all syllable segments can be counted, the median is taken as a unit length (unitdur), then the length of each syllable segment is divided by the unit length, and the logarithm of the division result is obtained, that is: log (x/unidur)/log (2) as a quantitative statistic. Where x indicates the length of a certain syllable segment.

And comparing the quantized statistical value with the 4 values of { -1,0,1,2} to find the closest value, and determining the quantized musical note according to the closest value.

For example, if the closest value is-1, then the quantized note is a 16-cent note. The closest value is 0, then the quantized note is a 16 cent note. The closest value is 1, then the quantized note is an 8 point note. The closest value is 2, then the quantized note is a 4-point note.

Of course, in other embodiments, the quantization may be performed in other manners, and the present embodiment is not limited thereto.

Substep S1043: and generating target human voice audio matched with the accompaniment audio according to the musical notes corresponding to the syllable segments, the consonant proportion information corresponding to the syllable segments and the rhythm information of the accompaniment audio.

In a specific implementation, the sub-step S1043 may be implemented as: determining a consonant part and a vowel part in the syllable segment according to the consonant proportion information; according to the rhythm information of the accompaniment audio, stretching or compressing the vowel part in the syllable segment to enable the time length of the vowel part to be matched with the time length indicated by the musical notes corresponding to the audio segment; and generating an acabela audio matched with the accompaniment audio according to the processed syllable segments.

After quantizing the syllable segments to the musical notes, the quantized musical notes are corresponding to the beats in the accompaniment audio according to the beat information of the accompaniment audio (the beat information includes BPM, namely beats per minute), and then the corresponding syllable segments are compressed or stretched according to the time length indicated by each musical note.

During compression or stretching, only the vowel part of the syllable segment is processed to match the time length of the whole syllable segment with the time length indicated by the note, so that the time length of the syllable segment can be changed without changing the tone of the syllable segment, and distortion is avoided.

After each syllable segment is processed, the processed syllable segments are linked together according to the time sequence to form the acabela audio.

Through this embodiment, through carrying out quantization process to syllable segment, quantize syllable segment to corresponding musical note to adjust the time length of every syllable segment according to the rhythm information of quantized musical note and accompaniment audio frequency, make the acabela audio frequency of generation correspond with the rhythm of accompaniment audio frequency well, thereby guarantee the correspondence of follow-up synthetic audio frequency well, promote synthetic effect moreover.

Usage scenarios

As shown in fig. 5, it shows a schematic flow chart of generating rap audio according to the multimedia data generation method in the present usage scenario. In this usage scenario, a process of generating a speaking audio will be described by taking the multimedia data generating method of the first to fourth embodiments as an example, which is executed by the smart terminal.

Step A: the input initial audio is obtained.

For example, a user may input initial audio by operating a smart terminal. The initial audio may be pre-recorded audio or real-time recorded audio.

And B: the input initial audio is then subjected to general noise reduction processing to eliminate the environmental noise as much as possible.

And C: and carrying out human voice activity detection on the audio subjected to noise reduction.

Wherein, the audio frequency after the noise reduction is human voice audio. Through the detection of the human voice activity, it can be determined which time periods have human voice and which time periods have no human voice in the noise-reduced audio. Wherein, the time quantum of having the voice is the voice activity area, and the time quantum of not having the voice is the voice activity area.

Step D: head analysis is performed on human voice audio.

The duration of the sound head may refer to the duration of the beginning of the note in the audio, and may also refer to the duration of the beginning of the phoneme in the speech. In the present usage scenario, the analysis of the vocal head is used to determine the start time and end time of each vocal syllable in the vocal audio, as well as the consonant portion and vowel portion of each vocal syllable (this information can be characterized by the consonant proportion).

The results of the analysis of the soundtrack may be cached in a cache library for subsequent use. Some accompaniment audio can be cached in the cache library, so that the speed of loading the accompaniment audio is higher.

Step E: the human voice is preprocessed.

For example, for the voice audio after noise reduction, compression processing, equalization processing and dynamic equalization processing are performed, so that the voice loudness of the voice audio is higher and clearer. The preprocessed human voice audio can be buffered in a buffer library for later use.

Step F: performing syllable extraction

And according to the sound head analysis result, carrying out audio segmentation processing on the human voice audio, and extracting syllable segments corresponding to the human voice syllables in the human voice audio. For Chinese, a syllable corresponds to a word.

Step G: note quantization processing is performed.

All extracted syllable segments are quantized to corresponding notes according to a certain quantization rule. These notes are, for example, 4-point notes, 8-point notes, 16-point notes, etc.

Step H: acabela audio (i.e., target human voice audio) is generated.

Based on the quantized notes and tempo information (i.e., BPM) of the selected accompaniment audio, the vowel portions of each syllable segment are compressed or stretched to the length of time indicated by the corresponding notes, and the processed syllable segments are linked together to form acabela audio.

The accompaniment audio may be accompaniment audio selected by the user from an accompaniment library. All accompaniment audio for speaking and singing is stored in the accompaniment library. The used accompaniment audio buffered in the buffer library may be accompaniment audio determined according to the use hotness of all users, and may be a part of the accompaniment audio stored in the accompaniment library.

Step I: post-processing the acabela audio and the accompaniment audio.

According to the acabela audio and the accompaniment audio, various audio effect processing is carried out on the acabela and the accompaniment respectively so as to increase the fullness of the audio. These audio effects include, but are not limited to, a loop effect, a time delay effect, a disc rub effect, and a tape effect, among others.

Step J: and carrying out automatic sound mixing processing.

Calculating the loudness of the acarbor audio and the accompaniment audio respectively, carrying out corresponding loudness gain processing on the acarbor audio and the accompaniment audio respectively according to the respective loudness, and then superposing the acarbor audio and the accompaniment audio together to generate the synthetic audio. Finally, a total gain processing is carried out on the synthesized audio through a limiter, and the rap audio with the accompaniment is produced.

Through the process of this use scene, the user need not follow the accompaniment and pronounces to the characters to generate the people's voice audio frequency, use energy spectrum and difference register to carry out the sound head analysis to people's voice audio frequency, make the people's voice syllable information that the analysis obtained more accurate, thereby make the effect of follow-up generation target people's voice audio frequency better, and then guaranteed to have generated the multimedia data that satisfies the requirement, promoted the suitability.

In the process of generating the Arabic audio, the human voice syllables are quantized to the musical notes, and the Arabic audio is generated according to the rhythm information of the accompaniment audio, so that the generated Arabic audio is matched with the rhythm of the accompaniment audio, and the effect of the generated multimedia data is further ensured.

EXAMPLE five

Referring to fig. 6a, a flowchart illustrating steps of a multimedia data generating method according to a fifth embodiment of the present invention is shown.

In this embodiment, the multimedia data generating method includes:

step S602: and carrying out energy spectrum and difference spectrum analysis on the frequency spectrum of the obtained human voice audio to be processed, and determining human voice syllable information in the human voice audio to be processed according to an analysis result.

The voice audio to be processed can be voice-containing audio recorded in real time by the user, or pre-recorded voice-containing audio, or voice-simulating audio converted according to characters input by the user, and the like.

The implementation manner of performing energy spectrum and differential spectrum analysis on the human voice audio to be processed may be the same as that in the first embodiment, and therefore, details are not described herein again. When the human voice audio is analyzed, the energy spectrum and the difference spectrum in the Mel frequency spectrum of the human voice audio are comprehensively used, so that the accuracy of the syllable segments obtained by analysis is higher, and the authenticity and naturalness of the multimedia data synthesized after the syllable segments are subsequently processed are better and more vivid.

In this embodiment, the obtained vocal syllable information includes time information of the vocal syllables, and the vocal syllable information may further include consonant proportion information of the vocal syllables as needed.

The time information of the human voice syllables can indicate the time of one human voice syllable in the form of a start time and an end time.

The consonant proportion information may indicate a consonant part and a vowel part in a human syllable by a consonant proportion value (or a vowel proportion value), and since a consonant is generally preceded and followed by a vowel in a human syllable, a consonant part time and a vowel part time may be determined according to time information of a human syllable and a consonant proportion value (or a vowel proportion value).

Of course, the vowel portion and the consonant portion in a human syllable may be indicated directly by using the consonant ending time or the vowel starting time, which is not limited in this embodiment.

Step S604: acquiring node information used for voice matching in accompaniment audio and/or audio-free video data, and processing voice syllables indicated by the voice syllable information according to the node information to generate target voice audio.

The accompanying audio and the video data without audio (hereinafter, simply referred to as video data) may be collectively referred to as data to be synthesized.

If the multimedia data to be generated is audio/video data, the data to be synthesized at least comprises video data. When the multimedia data contains the accompaniment, the data to be synthesized may further include accompaniment audio.

If the multimedia data to be generated is audio data, the data to be synthesized only includes accompaniment audio.

The node information for performing voice matching in the video data may be understood as time information of a video segment that needs dubbing. For example, a video segment requiring dubbing includes, but is not limited to, a video segment of a person speaking.

The node information of the accompaniment audio may be understood as the beat information of the accompaniment audio, which has been described in detail in the foregoing embodiments and will not be described herein again.

Note that, the node information may be included in the accompanying audio and/or video data in advance, for example, time information of a video clip of each person speaking in the video data is determined in advance and marked in the video data, or a beat of the accompanying audio is marked in the accompanying audio, and the like. Of course, the node information may also be obtained by analyzing and processing the accompanying audio and/or video data in real time.

The person skilled in the art can process the human voice audio according to the node information in any suitable manner, for example, perform a syllable segment extraction process, a syllable segment stretching process or a compressing process, a sound effect adding process, and the like, as long as the target human voice audio matching the node information can be obtained.

Step S606: and synthesizing at least one of the accompaniment audio and the video data with the target voice audio to generate multimedia data.

In a scene of dubbing video data, the video data and the target human voice audio are synthesized, or the video data, the accompaniment audio and the target human voice audio are synthesized.

In the audio synthesis scene, the accompaniment audio and the target human voice audio are synthesized.

According to the embodiment, when the human voice audio is analyzed to obtain the human voice syllable information, the energy spectrum and the difference spectrum of the frequency spectrum of the human voice audio are comprehensively used, so that the analysis is more comprehensive, the obtained human voice syllable information is more accurate, the human voice syllable indicated by the human voice syllable information can be subsequently processed according to the node information, the target human voice audio is generated, and the target human voice audio and at least one of the accompaniment audio and the video data are synthesized to generate the multimedia data. Can realize like this that the beat to the vocal syllable of target vocal audio frequency and accompaniment audio frequency matches to the messenger has solved and has relied on user oneself to match node information among the prior art, carries out the problem that the vocal audio frequency was recorded, thereby has avoided service environment's restriction, has promoted the suitability.

The following describes a process of generating multimedia data with reference to a specific usage scenario as follows:

as shown in fig. 6b, in the present usage scenario, the multimedia data generation method is performed by the composition module. The synthesis module is configured at the server (the server includes a server and/or a cloud), and the server generates multimedia data and returns the multimedia data to the terminal device. The synthesis module is configured at the server, so that the characteristic of strong data processing capability of the server can be fully utilized, and the processing speed is increased under the condition of ensuring the processing effect on the audio and video data of people. Of course, in other usage scenarios, the composition module may also be configured locally at the terminal device of the user, so as to implement the function of generating the multimedia data locally at the terminal device.

In the use scene, a user selects human voice audio and data to be synthesized through terminal equipment, and the multimedia data generation process is as follows:

as shown in fig. 6b, the user inputs the voice audio to be processed (hereinafter referred to as voice audio) and the data to be synthesized (the data to be synthesized in the present usage scenario is video data) through the terminal device, and sends them to the synthesis module.

Specifically, in the synthesis module, energy spectrum and difference spectrum analysis is performed on the frequency spectrum of the human voice audio to obtain human voice syllable information, and the human voice syllable information is used for indicating time information of each human voice syllable in the human voice audio. Therefore, the purpose of comprehensively utilizing the energy spectrum and the difference spectrum to accurately determine each voice syllable in the voice audio is achieved, and compared with a mode of extracting the voice syllables in the prior art, the mode is higher in accuracy, so that the authenticity and the naturalness of subsequently synthesized multimedia data are ensured.

In addition, in the synthesis module, node information is also acquired from the video data and/or the accompaniment audio, that is, time information of a video clip required to be matched with the voice is acquired from the video data, and/or beat information is acquired from the accompaniment audio.

And carrying out proper processing on the voice syllables indicated by the voice syllable information according to the node information to obtain the target voice audio matched with the node information.

And synthesizing the audio and video data of the target person to generate multimedia data after dubbing the video data, and outputting the synthesized multimedia data to the terminal equipment. The terminal device can display the multimedia data to the user after receiving the multimedia data.

EXAMPLE six

Referring to fig. 7, a flowchart illustrating steps of a multimedia data generating method according to a sixth embodiment of the present invention is shown.

In this embodiment, a specific description is given of an implementation process of the multimedia data generation method for a case where data to be synthesized only includes video data. The multimedia data generation method includes steps S602 to S606 described in the sixth embodiment.

The implementation process of step S602 may adopt the implementation process of step S602 in the sixth embodiment, and therefore is not described herein again.

In the present embodiment, for the case where only video data is included in the data to be synthesized, step S604 includes the following sub-steps:

sub-step S6041 a: and acquiring time information of a video clip to be subjected to voice matching in the video data.

The video segment to be subjected to voice matching can be understood as a video segment to be dubbed. For example, a video clip of a character speaking a speech in the video data.

In a specific implementation, the sub-step S6041a may be implemented as: performing image recognition processing on at least part of image frames in the video data, and determining a video clip containing continuous mouth shape opening and closing as a video clip to be subjected to voice matching according to an image recognition result; and acquiring time information of a video clip to be subjected to voice matching.

For example, image frames in video data are input into a trained neural network model capable of mouth shape recognition, whether the mouth shape of a person in the image frames is opened or closed is recognized by using the neural network model, and then a video segment consisting of a plurality of image frames with continuously transformed mouth shapes is determined. The neural network model may be any model capable of image recognition, such as a convolutional neural network model.

The start time and the end time of the consecutive image frames can be used as the time information of the corresponding video segment.

Substep S6042 a: and according to the time information of each voice syllable in the voice syllable information, carrying out audio segmentation processing on the voice audio to obtain syllable segments corresponding to each voice syllable.

It should be noted that there is no strict timing relationship between step S6041a and step S6042a, and the two steps may be executed in any order, sequentially or in parallel.

Substep S6043 a: and according to the time information of the video segments and the time information of the syllables of the human voice, stretching or compressing the syllable segments corresponding to the video segments to generate the target human voice audio.

Taking a video clip of a speech line spoken by a person in video data as an example, the time information of the video clip is [1:00,1:03 ]. And under the condition that the syllable segments of the human voice syllables corresponding to the video segments are determined to be 9 and the total time length is 2 seconds, in order to avoid the condition that the sound and the picture of the dubbed video segments are inconsistent, stretching at least part of the syllable segments to increase the time length of vowel parts of the syllable segments, so that the target human voice audio matched with the time information of the video segments is generated, namely the total time length of the syllable segments corresponding to the video segments in the target human voice audio is matched with the target human voice audio.

Similarly, if the total duration of the syllable segments corresponding to a certain video segment is greater than the duration of the video segment, the vowel part of at least some of the syllable segments can be compressed.

In step S606, the acquired audio and video data of the target person' S voice may be synthesized to generate multimedia data, i.e., video data with dubbing.

Through the embodiment, the syllable segments of the voice syllables are processed according to the time information of the video segments in the video data, so that the target voice audio matched with the time information of the video segments is obtained, the target voice audio and the video data are synthesized, the multimedia data are generated, the video data are automatically dubbed, and the reality and the naturalness of the dubbed multimedia data are ensured.

EXAMPLE seven

Referring to fig. 8, a flowchart illustrating steps of a multimedia data generating method according to a seventh embodiment of the present invention is shown.

In this embodiment, a specific description is given of an implementation process of the multimedia data generation method for a case where data to be synthesized includes video data and accompaniment audio. The multimedia data generation method includes steps S602 to S606 described in the sixth embodiment.

In this embodiment, for the case where the data to be synthesized includes video data and accompaniment audio, step S604 includes the following sub-steps:

sub-step S6041 b: obtaining beat information in the accompaniment audio, and obtaining syllable segments corresponding to the voice syllables according to the voice syllable information.

Tempo information in the accompaniment audio includes, but is not limited to, beats per minute.

In this embodiment, since the voice syllable information includes time information of voice syllables, obtaining syllable segments corresponding to the voice syllables according to the voice syllable information may be implemented as: and according to the time information of each voice syllable in the voice syllable information, carrying out audio segmentation processing on the voice audio to obtain syllable segments corresponding to each voice syllable.

The human voice audio can be segmented according to the time information of the human voice syllables in the human voice syllable information, and syllable segments of each human voice syllable are obtained.

Substep S6042 b: and performing stretching processing or compressing processing on the syllable segments corresponding to the beats according to the beat information to generate the target human voice audio.

In one implementation, the syllable segments are quantized to quantize them into corresponding notes. For example, the quantization process is implemented by the sub-step S1042 in the embodiment.

And corresponding the quantized notes to the beats in the accompaniment audio, and then compressing or stretching the corresponding syllable segments according to the time length indicated by each note.

And splicing the converted syllable segments to form the target human voice audio.

In this embodiment, step S606 includes the following sub-steps:

substep S6061: and acquiring time information of a video clip to be subjected to voice matching in the video data.

In a specific implementation, each image frame in the video data may be subjected to image recognition processing through a trained neural network model, the video data is divided into a plurality of video segments according to an image recognition result, and time information of each video segment is acquired.

Or, the video data may be manually segmented into a plurality of video segments to be subjected to voice matching in advance, and the time information of each video segment may be acquired.

Substep S6062: and according to the beat information in the accompaniment audio and the time information of the video clips, stretching or compressing the video clips corresponding to the beats in the accompaniment audio to generate video data containing the processed video clips.

Taking an MV for generating a rap song as an example, according to the beat information in the accompaniment audio and the time information of the video clips, the video clips corresponding to the beats in the accompaniment audio can be determined, and if the time length of the video clip is greater than the time length of the corresponding beat, the video clips are compressed, that is, some image frames in the video clips are deleted or the display time of at least part of the image frames is reduced; otherwise, if the time length of the video segment is less than the time length of the corresponding beat, the video segment is stretched, that is, the image frames in the video segment are increased or the display time of at least part of the image frames is increased.

And then, splicing the processed video clips according to a time sequence to generate video data containing the processed video clips.

Substep S6063: and synthesizing the target voice audio, the accompaniment audio and the video data containing the processed video clip to generate the multimedia data. Through this embodiment, with the beat information of accompaniment audio as the benchmark, handle treating vocal audio and video data, obtain with the target vocal audio of the beat matching of accompaniment audio and the video data who contains the video clip after handling, again with target vocal audio the accompaniment audio is synthesized with the video data who contains the video clip after handling, generates multimedia data for the multimedia data that synthesizes is more natural and true.

Example eight

Referring to fig. 9a, a flowchart illustrating steps of a multimedia data processing method according to an eighth embodiment of the present invention is shown.

In this embodiment, the multimedia data processing method includes the steps of:

step S902: and acquiring audio data containing human voice and audio and data to be synthesized according to the triggering operation.

In this embodiment, the multimedia data processing method is executed by the terminal device, and the multimedia data is generated by the terminal device based on the audio data. Of course, in other embodiments, the multimedia data may also be generated by the server (the server includes a server and/or a cloud) according to the audio data.

As shown in fig. 9b, the multimedia data processing method is described by taking an example that the multimedia data processing method is executed by an embedded module of a multimedia playing application on a terminal device. In this case, the terminal device may display an interface as shown in interface 1. And when the triggering operation of adding the voice audio option in the user triggering interface 1 is received, calling a recording device on the terminal device, collecting the voice of the user and generating audio data containing the voice audio. Or when receiving the triggering operation of adding the voice audio option in the user triggering interface 1, reading the storage space of the terminal device, and acquiring the audio data containing the voice audio selected by the user.

The data to be synthesized includes accompanying audio and/or video data without audio.

When receiving a trigger operation of a user for adding a data option to be synthesized in the interface 2 in fig. 9b, displaying the candidate accompaniment audio and/or the candidate video data, and determining the selected accompaniment audio and/or the selected video data as the data to be synthesized according to a selection operation of the user for the candidate accompaniment audio and/or the selection operation for the candidate video data.

After the audio data and the data to be synthesized are acquired, an interface shown as an interface 3 in fig. 9b is displayed.

Step S904: and acquiring multimedia data generated according to the audio data.

The multimedia data may be generated by a composition module. The synthesis module may be configured locally at the terminal device, and the audio data may be processed locally at the terminal device to generate the multimedia data. Or the synthesis module is configured at the server, and the terminal device sends the audio data to the synthesis module at the server, and the audio data is processed by the synthesis module to generate the multimedia data.

Wherein the multimedia data is generated by:

and identifying voice syllable information of the audio data, processing voice syllables indicated by the voice syllable information according to node information used for voice matching in the video data of the accompaniment audio and/or the non-audio to obtain target voice audio, and synthesizing at least one of the video data of the accompaniment audio and the non-audio with the target voice audio to obtain multimedia data.

The voice syllable information for identifying the audio data may be implemented in step S102 or step 602 in the foregoing embodiments, and therefore, the description is omitted. The time information of the human voice syllable can be included in the human voice syllable information. And the information of the consonant proportion in the human syllable can be included according to the requirement.

Information of accompaniment audio and/or video data without audio (hereinafter referred to as video data) may be determined in advance or determined by the terminal device according to a user's selection operation.

The node information for vocal matching in the accompaniment audio may be beat information, but is not limited thereto.

The node information for performing voice matching in the video data may be time information of a video clip to be subjected to voice matching, but is not limited thereto.

One skilled in the art can process the syllable segment corresponding to the human voice syllable indicated by the human voice syllable information according to the node information in any suitable manner (for example, quantization processing, stretching processing, compression processing, etc.) as long as the target human voice audio matching the node information can be obtained.

After the target vocal audio is obtained, at least one of the accompaniment audio and the video data is synthesized with the target vocal audio to generate the multimedia data.

Of course, in other embodiments, the multimedia data may be generated by using the multimedia data generation method described in any one of the foregoing embodiments one to seven.

Step S906: and providing the multimedia data on a display interface.

After the multimedia video is acquired, the terminal device may display an interface shown as an interface 4 in fig. 9b to provide the multimedia data to the user, so that the user can view or perform other operations conveniently. Optionally, in a specific usage scenario, the multimedia data processing method is implemented as an example by an inline module in a social application on a terminal device. Of course, in other usage scenarios, the application may be any other suitable application.

Taking the social application as an example, the multimedia data received from other users can be displayed in the chat window of the social application. The multimedia data may include at least one of audio data, video data, and image data. The video data may be with or without audio video data. A schematic diagram of an interface for displaying received multimedia data in a chat window is shown as interface 1 in fig. 9 c.

The user may trigger an editing interface (as shown in interface 2 in fig. 9 c) for processing the received multimedia data by, for example, pressing the received multimedia data for a long time. Through the editing interface, a user can utilize the embedded module to edit the received multimedia data, so that edited multimedia data are generated, and the edited multimedia data are displayed.

For example, if the received multimedia data is an inaudible motion picture, the user may execute the aforementioned multimedia data processing method through the embedded module to generate a dubbing motion picture. If the received multimedia data is video data for singing, the user can execute the multimedia data processing method through the embedded module to generate a video for singing, and the like.

In the present usage scenario, as shown in interface 2 in fig. 9c, the editing interface includes a first area for presenting editing function options, a second area for presenting the received multimedia data, and a third area for previewing the multimedia data edited by the user.

One or more edit function options are configured in the first region, the edit function options including, but not limited to, a first option for audio and a second option for image frames in the video.

Wherein, the first options for audio include, but are not limited to, "dubbing options", "sound effects options", and "style options", etc.

The dubbing option is used to replace the audio in the audio-video data with new audio or add new audio to the non-audio-video data, audio data, and image data.

The sound effect option is used to add sound effects to the audio, such as audio effects like a music-on, echo, sound-changing, etc.

The style options are used to process the audio into a corresponding style, such as a heavy metal style, a rock style, a national style, and so on.

The second options for image frames in the video include, but are not limited to, "video option," "filter option," and "map option," among others.

The video option is used to add new video or image frames.

The filter option is for adding filters in one or more image frames in the video, where the filters include blur filters, color filters, and the like.

The map option is used to add a map in one or more image frames in the video.

Therefore, the user can select a desired style according to requirements when editing the received multimedia data, on one hand, the interactivity can be improved, and on the other hand, the user can edit the desired multimedia data more easily.

It should be noted that the edit function options are not limited to the exemplified options, and may include other options.

The following describes a process of executing the multimedia data processing method by the embedded module by taking the generation of the antiphonal singing video as an example.

When the user holds down the received multimedia data as shown in interface 1 of fig. 9c, candidate options such as "verse video option", "dubbing movie option", etc. may be displayed in the interface for the user to select. The corresponding editing interface (as shown in interface 2 in fig. 9 c) is displayed according to the selection of the user.

Taking the generation of the antiphonal singing video as an example, the received multimedia data (the video singing a song in the present usage scenario is taken as an example) is displayed in the second area of the editing interface. If the user triggers the dubbing option, the user can select the modes of recording the sound in real time, uploading the recorded audio or inputting characters to convert the characters into the audio in real time and the like to add the audio. Of course, the user may also add a new video by triggering the video option, and if the user does not add a new video, it indicates that the video in the received multimedia data is edited.

The embedded module executes a multimedia data processing method according to the acquired audio and/or video added by the user, acquires the synthesized multimedia data (i.e. edited multimedia data), and displays the synthesized multimedia data in a third area in real time.

In the process, the user can also trigger sound effect options and/or style options, and the embedded module adds required sound effects and/or adds required styles in the audio according to the triggering operation of the user. And/or the user can also trigger a filter option and/or a mapping option, and the embedded module adds a filter effect and/or maps in the video according to the triggering operation of the user, and the like.

After the edited multimedia data required by the user is generated, the received multimedia data and the edited multimedia data can be combined to generate the antiphonal singing video. The generated antiphonal singing video can be displayed in an interface of a chat window, and the display interface is shown as an interface 3 in fig. 9 c. It should be noted that the interface for displaying the sing-song video is not limited to the chat window, and may also be displayed in any other suitable manner, which is not limited in this embodiment.

The process of generating the dubbing voice map is similar to the process of generating the antiphonal singing video, and the process of combining the received multimedia data and the edited multimedia data is omitted, so the details are not repeated herein.

By the embodiment, the audio data containing the human voice audio can be collected, the real and natural multimedia data meeting the requirements can be obtained by processing the audio data, and the multimedia data can be provided for the user. Therefore, the automatic generation of the multimedia data is realized, and the authenticity and naturalness of the generated multimedia data can be ensured.

Example nine

Referring to fig. 10, a block diagram of a multimedia data generating apparatus according to a ninth embodiment of the present invention is shown.

The multimedia data generation apparatus of the present embodiment includes: the first analysis module 1002 is configured to perform spectrum energy spectrum and difference spectrum analysis on the obtained human voice audio to be processed, and determine human voice syllable information in the human voice audio to be processed according to an analysis result; a first target vocal generation module 1004, configured to process vocal syllables indicated by the vocal syllable information according to tempo information of the accompaniment audio to generate a target vocal audio matched with the accompaniment audio; a first synthesizing module 1006, configured to synthesize the target vocal audio and the accompaniment audio to generate multimedia data.

Optionally, the first synthesis module 1006 comprises: a sound effect processing module 10061, configured to perform sound effect adding processing on the target human voice audio and the accompaniment audio respectively; the multimedia data generating module 10062 is configured to synthesize the target human voice audio and the accompaniment audio after the sound effect is added, so as to generate the multimedia data.

Optionally, the sound effect processing module 10061 is configured to obtain first pause time information, where a pause duration in the target human voice audio is greater than a first preset duration; determining at least one corresponding first audio clip from the target human voice audio and at least one corresponding second audio clip from the accompaniment audio according to the time information of the first pause; and respectively carrying out sound effect adding processing on the first audio clip and the second audio clip to generate target human voice audio and accompaniment audio after the corresponding sound effect is added.

Optionally, when the sound effect processing module 10061 performs sound effect adding processing on the first audio clip and the second audio clip respectively to generate a target vocal audio and an accompaniment audio with corresponding added sound effects, according to a preset first sound effect algorithm, performing delay processing on each first audio clip to obtain a first effect clip corresponding to each first audio clip, and replacing each first audio clip in the target vocal audio with the corresponding first effect clip; and carrying out mute processing on each second audio clip according to a preset second sound effect algorithm, acquiring a second effect clip corresponding to each second audio clip, and replacing each second audio clip in the accompaniment audio with the corresponding second effect clip.

Optionally, the sound effect processing module 10061 is further configured to obtain second pause time information, where a pause time duration in the target human voice audio is less than or equal to a first preset time duration and greater than a second preset time duration; determining at least one corresponding third audio clip from the accompaniment audio according to the time information of the second pause; and performing sound effect adding processing on each third audio clip to obtain corresponding accompaniment audio.

Optionally, when the sound effect processing module 10061 performs sound effect adding processing on each third audio clip to obtain corresponding accompaniment audio, perform audio reciprocating processing on each third audio clip to obtain corresponding accompaniment audio; and/or, carrying out playing speed increasing processing on each third audio clip to obtain corresponding accompaniment audio; and/or carrying out play deceleration processing on each third audio clip to obtain corresponding accompaniment audio.

Optionally, the multimedia data generating module 10062 is configured to calculate a loudness of the target human voice audio and a loudness of the accompaniment audio after the sound effect is added, and perform loudness gain processing on the target human voice audio and the accompaniment audio respectively according to the loudness of the target human voice audio and the loudness of the accompaniment audio; superposing the target human voice audio subjected to gain processing with the accompaniment audio to obtain a synthetic audio; and generating the multimedia data according to the synthesized audio.

Optionally, the multimedia data generating module 10062 is configured to, when the multimedia data is generated according to the synthesized audio, perform total gain processing on the synthesized audio, and generate the multimedia data according to the synthesized audio after the total gain.

Optionally, the first analysis module 1002 comprises: a sequence generating module 10021, configured to analyze an energy spectrum of a frequency spectrum of the human voice audio to be processed, to obtain a first sequence indicating each human voice section in the human voice audio to be processed; analyzing a difference spectrum of the frequency spectrum of the human voice audio to obtain a second sequence used for indicating each human voice syllable to be processed in the human voice audio; a determining module 10022, configured to determine, according to the first sequence and the second sequence, vocal syllable information in the vocal audio to be processed.

Optionally, the frequency spectrum of the human voice audio to be processed is a mel frequency spectrum.

Optionally, the human voice syllable information includes time information of human voice syllables and consonant proportion information of the human voice syllables.

Optionally, the first target human voice generating module 1004 includes: a segmenting module 10041, configured to perform audio segmentation processing on the voice audio according to the time information of each voice syllable in the voice syllable information, and obtain a syllable segment corresponding to each voice syllable; a quantization module 10042, configured to perform quantization processing on the syllable segments of each vocal syllable, and quantize the syllable segments of each vocal syllable into corresponding musical notes; the akabeila generating module 10043 is configured to generate a target vocal audio matched with the accompaniment audio according to the musical notes corresponding to the syllable segments, the consonant proportion information corresponding to the syllable segments, and the tempo information of the accompaniment audio.

Optionally, the acarba generating module 10043 is configured to determine a consonant part and a vowel part in the syllable segment according to the consonant proportion information; according to the rhythm information of the accompaniment audio, stretching or compressing the vowel part in the syllable segment to enable the time length of the vowel part to be matched with the time length indicated by the musical notes corresponding to the audio segment; and generating an acabela audio matched with the accompaniment audio according to the processed syllable segments.

The multimedia data generating device of this embodiment is used to implement the corresponding multimedia data generating method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the functional implementation of each module in the multimedia data generating device of this embodiment can refer to the description of the corresponding part in the foregoing method embodiments, and is not repeated herein.

Example ten

Referring to fig. 11, a block diagram of a multimedia data generating apparatus according to a tenth embodiment of the present invention is shown.

In this embodiment, the multimedia data generating apparatus includes: the second analysis module 1102 is configured to perform spectrum energy spectrum and difference spectrum analysis on the obtained human voice audio to be processed, and determine human voice syllable information in the human voice audio to be processed according to an analysis result; a second target voice generation module 1104, configured to acquire node information used for voice matching in the video data of the accompaniment audio and/or the non-audio, and process voice syllables indicated by the voice syllable information according to the node information to generate a target voice audio; a second synthesizing module 1106, configured to synthesize at least one of the accompaniment audio and the video data with the target vocal audio to generate multimedia data.

Optionally, the vocal syllable information includes time information of vocal syllables; the second target human sound generation module 1104 includes: the video processing module 11041 is configured to obtain time information of a video segment to be subjected to voice matching in the video data; a syllable segment extraction module 11042, configured to perform audio segmentation processing on the voice audio according to time information of each voice syllable in the voice syllable information, to obtain a syllable segment corresponding to each voice syllable; and a voice syllable processing module 11043, configured to perform stretching processing or compressing processing on the syllable segments corresponding to the video segments according to the time information of the video segments and the time information of the voice syllables, so as to generate the target voice audio.

Optionally, the video processing module 11041 is configured to perform image recognition processing on at least a part of image frames in the video data, and determine, according to an image recognition result, a video segment containing continuous opening and closing of a mouth as a video segment to be subjected to voice matching; and acquiring time information of a video clip to be subjected to voice matching.

Optionally, the second target human voice generating module 1104 is configured to obtain tempo information in the accompaniment audio, and obtain syllable segments corresponding to the human voice syllables according to the human voice syllable information; and performing stretching processing or compressing processing on the syllable segments corresponding to the beats according to the beat information to generate the target human voice audio.

Optionally, the voice syllable information includes time information of voice syllables; the second target human voice generating module 1104 is configured to, when obtaining the syllable segments corresponding to the human voice syllables according to the human voice syllable information, perform audio segmentation processing on the human voice audio according to the time information of the human voice syllables in the human voice syllable information to obtain the syllable segments corresponding to the human voice syllables.

Optionally, the second synthesizing module 1106 is configured to obtain time information of a video segment to be subjected to voice matching in the video data; according to the beat information in the accompaniment audio and the time information of the video clips, stretching or compressing the video clips corresponding to the beats in the accompaniment audio to generate video data containing processed video clips; and synthesizing the target voice audio, the accompaniment audio and the video data containing the processed video clip to generate the multimedia data.

EXAMPLE eleven

Referring to fig. 12, a block diagram of a multimedia data processing apparatus according to an eleventh embodiment of the present invention is shown.

In this embodiment, the multimedia data processing apparatus includes: the acquisition module 1202 is used for acquiring audio data containing human voice audio according to triggering operation; a multimedia obtaining module 1204, configured to obtain multimedia data generated according to the audio data, where the multimedia data is obtained by identifying vocal syllable information of the audio data and processing a vocal syllable indicated by the vocal syllable information according to node information for performing vocal matching in accompaniment audio and/or audio-free video data to obtain target vocal audio, and synthesizing at least one of the accompaniment audio and the audio-free video data with the target vocal audio; a display module 1206 for providing the multimedia data at a display interface.

The multimedia data processing apparatus of this embodiment is configured to implement the corresponding multimedia data processing method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the functional implementation of each module in the multimedia data processing apparatus of this embodiment can refer to the description of the corresponding part in the foregoing method embodiment, and is not repeated herein.

Example twelve

Referring to fig. 13, a schematic structural diagram of an electronic device according to a twelfth embodiment of the present invention is shown, and the specific embodiment of the present invention does not limit the specific implementation of the electronic device.

As shown in fig. 13, the electronic device may include: a processor (processor)1302, a communication Interface (Communications Interface)1304, a memory (memory)1306, and a communication bus 1308.

Wherein:

the processor 1302, communication interface 1304, and memory 1306 communicate with each other via a communication bus 1308.

A communication interface 1304 for communicating with other electronic devices such as a terminal device or a server.

The processor 1302 is configured to execute the program 1310, and may specifically execute the relevant steps in the above-described multimedia data generation method embodiment.

In particular, the program 1310 may include program code that includes computer operating instructions.

The processor 1302 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the invention. The electronic device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

A memory 1306 for storing a program 1310. Memory 1306 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 1310 may specifically be configured to cause the processor 1302 to perform the following operations: carrying out energy spectrum and difference spectrum analysis on the frequency spectrum of the obtained human voice audio to be processed, and determining human voice syllable information in the human voice audio to be processed according to an analysis result; processing the voice syllables indicated by the voice syllable information according to the beat information of the accompaniment audio to generate target voice audio matched with the accompaniment audio; and synthesizing the target voice audio and the accompaniment audio to generate multimedia data.

In an alternative embodiment, the program 1310 is further configured to enable the processor 1302 to perform an additional sound effect process on the target vocal audio and the accompaniment audio respectively when the target vocal audio and the accompaniment audio are synthesized to generate multimedia data; and synthesizing the target human voice audio and the accompaniment audio after the sound effect is added to generate the multimedia data.

In an optional embodiment, the program 1310 is further configured to enable the processor 1302 to obtain time information of a first pause in the target human voice audio, where a pause duration of the time information is greater than a first preset duration, when the sound effect adding processing is performed on the target human voice audio and the accompaniment audio respectively; determining at least one corresponding first audio clip from the target human voice audio and at least one corresponding second audio clip from the accompaniment audio according to the time information of the first pause; and respectively carrying out sound effect adding processing on the first audio clip and the second audio clip to generate target human voice audio and accompaniment audio after the corresponding sound effect is added.

In an optional implementation manner, the program 1310 is further configured to enable the processor 1302, when performing sound effect adding processing on the first audio clip and the second audio clip respectively to generate corresponding target human voice audio and accompaniment audio after the sound effect processing, perform delay processing on each first audio clip according to a preset first sound effect algorithm to obtain a first effect clip corresponding to each first audio clip, and replace each first audio clip in the target human voice audio with the corresponding first effect clip; and carrying out mute processing on each second audio clip according to a preset second sound effect algorithm, acquiring a second effect clip corresponding to each second audio clip, and replacing each second audio clip in the accompaniment audio with the corresponding second effect clip.

In an optional implementation manner, the program 1310 is further configured to enable the processor 1302 to obtain second pause time information, where a pause duration of the target human voice audio is less than or equal to a first preset duration and greater than a second preset duration, when the target human voice audio and the accompaniment audio are subjected to sound effect adding processing respectively; determining at least one corresponding third audio clip from the accompaniment audio according to the time information of the second pause; and performing sound effect adding processing on each third audio clip to obtain corresponding accompaniment audio.

In an alternative embodiment, the program 1310 is further configured to enable the processor 1302 to perform audio reciprocating processing on each third audio clip to obtain corresponding accompaniment audio when performing sound effect adding processing on each third audio clip to obtain corresponding accompaniment audio; and/or, carrying out playing speed increasing processing on each third audio clip to obtain corresponding accompaniment audio; and/or carrying out play deceleration processing on each third audio clip to obtain corresponding accompaniment audio.

In an alternative embodiment, the program 1310 is further configured to enable the processor 1302 to calculate the loudness of the target human voice audio and the loudness of the accompaniment audio after the sound effect addition when the processed target human voice audio and the accompaniment audio are synthesized to generate the multimedia data, and perform loudness gain processing on the target human voice audio and the accompaniment audio according to the loudness of the target human voice audio and the loudness of the accompaniment audio respectively; superposing the target human voice audio subjected to gain processing with the accompaniment audio to obtain a synthetic audio; and generating the multimedia data according to the synthesized audio.

In an alternative embodiment, program 1310 is further configured to cause processor 1302 to perform total gain processing on the synthesized audio when generating the multimedia data according to the synthesized audio, and generate the multimedia data according to the synthesized audio after total gain.

In an alternative embodiment, the program 1310 is further configured to cause the processor 1302 to analyze the energy spectrum of the frequency spectrum of the human voice audio to be processed to obtain a first sequence indicating each human voice syllable in the human voice audio to be processed when performing energy spectrum and differential spectrum analysis on the obtained human voice audio to be processed and determining human voice syllable information in the human voice audio to be processed according to the analysis result; analyzing a difference spectrum of the frequency spectrum of the human voice audio to be processed to obtain a second sequence used for indicating each human voice syllable in the human voice audio to be processed; and determining the voice syllable information in the voice audio according to the first sequence and the second sequence.

In an alternative embodiment, the frequency spectrum of the human voice audio to be processed is a mel frequency spectrum.

In an alternative embodiment, the human syllable information includes time information of a human syllable and consonant proportion information of the human syllable.

In an alternative embodiment, the program 1310 is further configured to, when the processor 1302 processes the syllables indicated by the human voice syllable information according to the tempo information of the accompaniment audio to generate the target human voice audio matching with the accompaniment audio, perform audio segmentation on the human voice audio according to the time information of each human voice syllable in the human voice syllable information to obtain syllable segments corresponding to the syllable segments of each human voice syllable; quantizing the syllable segments of the voice syllables, and quantizing the voice syllables into corresponding musical notes; and generating target human voice audio matched with the accompaniment audio according to the musical notes corresponding to the syllable segments, the consonant proportion information corresponding to the syllable segments and the rhythm information of the accompaniment audio.

In an alternative embodiment, the program 1310 is further configured to cause the processor 1302 to determine a consonant part and a vowel part in each syllable segment according to the consonant proportion information when generating a target vocal audio matching the accompaniment audio according to the musical notes corresponding to the syllable segments, the consonant proportion information corresponding to the syllable segments, and the tempo information of the accompaniment audio; according to the rhythm information of the accompaniment audio, stretching or compressing the vowel part in the syllable segment to enable the time length of the vowel part to be matched with the time length indicated by the musical notes corresponding to the audio segment; and generating an acabela audio matched with the accompaniment audio according to the processed syllable segments.

Alternatively, the program 1310 may specifically be configured to cause the processor 1302 to perform the following operations: carrying out energy spectrum and difference spectrum analysis on the frequency spectrum of the obtained human voice audio to be processed, and determining human voice syllable information in the human voice audio to be processed according to an analysis result; acquiring node information for voice matching in accompaniment audio and/or audio-free video data, and processing voice syllables indicated by the voice syllable information according to the node information to generate target voice audio; and synthesizing at least one of the accompaniment audio and the video data with the target voice audio to generate multimedia data.

In an alternative embodiment, the human syllable information includes time information of human syllables; the program 1310 is further configured to enable the processor 1302 to obtain node information for performing voice matching in the video data with accompaniment audio and/or no audio, and to obtain time information of a video segment to be subjected to voice matching in the video data when a voice syllable indicated by the voice syllable information is processed according to the node information to generate a target voice audio; according to the time information of each voice syllable in the voice syllable information, carrying out audio segmentation processing on the voice audio to obtain syllable segments corresponding to each voice syllable; and according to the time information of the video segments and the time information of the syllables of the human voice, stretching or compressing the syllable segments corresponding to the video segments to generate the target human voice audio.

In an optional implementation manner, the program 1310 is further configured to enable the processor 1302 to perform image recognition processing on at least a part of image frames in the video data when the time information of a video segment to be subjected to voice matching in the video data is obtained, and determine, according to an image recognition result, a video segment containing continuous opening and closing of a mouth as the video segment to be subjected to voice matching; and acquiring time information of a video clip to be subjected to voice matching.

In an alternative embodiment, the program 1310 is further configured to enable the processor 1302 to obtain node information for performing voice matching in the obtained accompaniment audio and/or the audio-free video data, obtain tempo information in the accompaniment audio when the voice syllables indicated by the voice syllable information are processed according to the node information to generate the target voice audio, and obtain syllable segments corresponding to the voice syllables according to the voice syllable information; and performing stretching processing or compressing processing on the syllable segments corresponding to the beats according to the beat information to generate the target human voice audio.

In an alternative embodiment, the voice syllable information comprises time information of voice syllables; the program 1310 is further configured to enable the processor 1302 to perform audio segmentation processing on the human voice audio according to the time information of each human voice syllable in the human voice syllable information when obtaining the syllable segment corresponding to each human voice syllable according to the human voice syllable information, so as to obtain the syllable segment corresponding to each human voice syllable.

In an alternative embodiment, the program 1310 is further configured to enable the processor 1302 to obtain time information of a video segment to be subjected to voice matching in the video data when at least one of the accompaniment audio and the video data is synthesized with the target voice audio to generate multimedia data; according to the beat information in the accompaniment audio and the time information of the video clips, stretching or compressing the video clips corresponding to the beats in the accompaniment audio to generate video data containing processed video clips; and synthesizing the target voice audio, the accompaniment audio and the video data containing the processed video clip to generate the multimedia data.

Alternatively, the program 1310 may specifically be configured to cause the processor 1302 to perform the following operations: acquiring audio data containing human voice audio according to triggering operation; acquiring multimedia data generated according to the audio data, wherein the multimedia data is obtained by identifying voice syllable information of the audio data and processing voice syllables indicated by the voice syllable information according to node information used for voice matching in accompaniment audio and/or audio-free video data to obtain target voice audio and synthesizing at least one of the accompaniment audio and the audio-free video data with the target voice audio; and providing the multimedia data on a display interface.

For specific implementation of each step in the program 1310, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing multimedia data generation method embodiments, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

Through the electronic equipment of this embodiment, when carrying out the analysis and obtaining the vocal syllable information to the vocal audio, synthesize the energy spectrum and the difference spectrum of using the frequency spectrum of the vocal audio for the analysis is more comprehensive, the vocal syllable information of acquireing is more accurate, thereby makes follow-up can handle the vocal syllable that the vocal syllable information instructs according to the beat information of accompaniment audio, generates target vocal audio, and synthesizes target vocal audio and accompaniment audio, generates multimedia data. Can realize like this that the rhythm to the vocal syllable of target vocal audio frequency matches with the rhythm of the accompaniment audio frequency to the messenger has solved and has relied on the user to oneself to match the accompaniment rhythm among the prior art, and the user must record the problem of sound at the in-process of broadcast accompaniment, thereby has avoided service environment's restriction, has promoted the suitability.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present invention may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present invention.

The above-described method according to an embodiment of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It is understood that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the multimedia data generation methods described herein. Further, when a general-purpose computer accesses code for implementing the multimedia data generation method shown herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the multimedia data generation method shown herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

The above embodiments are only for illustrating the embodiments of the present invention and not for limiting the embodiments of the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present invention, so that all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the scope of patent protection of the embodiments of the present invention should be defined by the claims.

Claims

1. A method for generating multimedia data, comprising:

carrying out energy spectrum and difference spectrum analysis on the frequency spectrum of the obtained human voice audio to be processed, and determining human voice syllable information in the human voice audio to be processed according to an analysis result;

processing the voice syllables indicated by the voice syllable information according to the beat information of the accompaniment audio to generate target voice audio matched with the accompaniment audio;

and synthesizing the target voice audio and the accompaniment audio to generate multimedia data.

2. The method of claim 1, wherein said synthesizing the target vocal audio and the accompaniment audio to generate multimedia data comprises:

respectively carrying out sound effect adding processing on the target human voice audio and the accompaniment audio;

and synthesizing the target human voice audio and the accompaniment audio after the sound effect is added to generate the multimedia data.

3. The method according to claim 2, wherein the adding sound effect processing for the target human voice audio and the accompaniment audio respectively comprises:

acquiring first pause time information with pause duration longer than a first preset duration in the target human voice audio;

determining at least one corresponding first audio clip from the target human voice audio and at least one corresponding second audio clip from the accompaniment audio according to the time information of the first pause;

and respectively carrying out sound effect adding processing on the first audio clip and the second audio clip to generate target human voice audio and accompaniment audio after the corresponding sound effect is added.

4. The method according to claim 3, wherein the adding sound effect processing to the first audio clip and the second audio clip respectively to generate corresponding sound effect added target human voice audio and accompaniment audio comprises:

according to a preset first sound effect algorithm, carrying out time delay processing on each first audio frequency segment to obtain a first effect segment corresponding to each first audio frequency segment, and replacing each first audio frequency segment in the target human voice audio frequency with the corresponding first effect segment;

and carrying out mute processing on each second audio clip according to a preset second sound effect algorithm, acquiring a second effect clip corresponding to each second audio clip, and replacing each second audio clip in the accompaniment audio with the corresponding second effect clip.

5. The method according to claim 3, wherein the adding sound effect processing is performed on the target human voice audio and the accompaniment audio respectively, and further comprising:

acquiring second pause time information of which the pause time length in the target human voice audio is less than or equal to the first preset time length and is greater than a second preset time length;

determining at least one corresponding third audio clip from the accompaniment audio according to the time information of the second pause;

and performing sound effect adding processing on each third audio clip to obtain corresponding accompaniment audio.

6. The method of claim 5, wherein said adding sound effect processing to each of said third audio clips to obtain corresponding accompaniment audio comprises:

performing audio reciprocating processing on each third audio clip to obtain corresponding accompaniment audio;

and/or the presence of a gas in the gas,

carrying out playing speed increasing processing on each third audio clip to obtain corresponding accompaniment audio;

and/or the presence of a gas in the gas,

and carrying out play deceleration processing on each third audio clip to obtain corresponding accompaniment audio.

7. The method of claim 2, wherein synthesizing the target human voice audio and the accompaniment audio with the added sound effect to generate the multimedia data comprises:

calculating the loudness of the target human voice audio and the loudness of the accompaniment audio after the sound effect is added, and performing loudness gain processing on the target human voice audio and the accompaniment audio respectively according to the loudness of the target human voice audio and the loudness of the accompaniment audio;

superposing the target human voice audio subjected to gain processing with the accompaniment audio to obtain a synthetic audio;

and generating the multimedia data according to the synthesized audio.

8. The method of claim 7, wherein generating the multimedia data from the synthesized audio comprises:

and carrying out total gain processing on the synthesized audio, and generating the multimedia data according to the synthesized audio after the total gain.

9. The method according to claim 1, wherein the performing energy spectrum and difference spectrum analysis on the obtained human voice audio to be processed, and determining human voice syllable information in the human voice audio to be processed according to the analysis result comprises:

analyzing an energy spectrum of a frequency spectrum of the human voice audio to be processed to obtain a first sequence for indicating each human voice section in the human voice audio to be processed; analyzing a difference spectrum of the frequency spectrum of the human voice audio to be processed to obtain a second sequence used for indicating each human voice syllable in the human voice audio to be processed;

and determining the voice syllable information in the voice audio to be processed according to the first sequence and the second sequence.

10. The method of claim 9, wherein the frequency spectrum of the human voice audio to be processed is a mel frequency spectrum.

11. The method of claim 1, wherein the vocal syllable information includes time information of a vocal syllable and consonant ratio information of the vocal syllable.

12. The method according to claim 11, wherein the processing the syllables indicated by the vocal syllable information according to the tempo information of the accompaniment audio to generate the target vocal audio matching the accompaniment audio comprises:

according to the time information of each voice syllable in the voice syllable information, carrying out audio segmentation processing on the voice audio to obtain syllable segments corresponding to each voice syllable;

quantizing the syllable segments of the voice syllables to obtain corresponding musical notes;

and generating target human voice audio matched with the accompaniment audio according to the musical notes corresponding to the syllable segments, the consonant proportion information corresponding to the syllable segments and the rhythm information of the accompaniment audio.

13. The method of claim 12, wherein generating the target vocal audio matching the accompaniment audio according to the musical note corresponding to each syllable segment, the consonant proportion information corresponding to each syllable segment, and the tempo information of the accompaniment audio comprises:

determining a consonant part and a vowel part in the syllable segment according to the consonant proportion information;

according to the rhythm information of the accompaniment audio, stretching or compressing the vowel part in the syllable segment to enable the time length of the vowel part to be matched with the time length indicated by the musical notes corresponding to the audio segment;

and generating an acabela audio matched with the accompaniment audio according to the processed syllable segments.

14. A method for generating multimedia data, comprising:

acquiring node information for voice matching in accompaniment audio and/or audio-free video data, and processing voice syllables indicated by the voice syllable information according to the node information to generate target voice audio;

and synthesizing at least one of the accompaniment audio and the video data with the target voice audio to generate multimedia data.

15. The method of claim 14, wherein the vocal syllable information includes time information of vocal syllables;

the acquiring node information used for voice matching in the video data of the accompaniment audio and/or the non-audio, and processing the voice syllables indicated by the voice syllable information according to the node information to generate the target voice audio, includes:

acquiring time information of a video clip to be subjected to voice matching in the video data;

and according to the time information of the video segments and the time information of the syllables of the human voice, stretching or compressing the syllable segments corresponding to the video segments to generate the target human voice audio.

16. The method according to claim 15, wherein the obtaining time information of a video segment to be voice-matched in the video data comprises:

performing image recognition processing on at least part of image frames in the video data, and determining a video clip containing continuous mouth shape opening and closing as a video clip to be subjected to voice matching according to an image recognition result;

and acquiring time information of a video clip to be subjected to voice matching.

17. The method according to claim 14, wherein the obtaining node information for voice matching in the video data with accompanying audio and/or without audio, and processing the voice syllables indicated by the voice syllable information according to the node information to generate the target voice audio comprises:

acquiring beat information in the accompaniment audio, and acquiring syllable segments corresponding to the voice syllables according to the voice syllable information;

and performing stretching processing or compressing processing on the syllable segments corresponding to the beats according to the beat information to generate the target human voice audio.

18. The method of claim 17, wherein the vocal syllable information includes time information of vocal syllables;

the obtaining of the syllable segments corresponding to the voice syllables according to the voice syllable information includes:

and according to the time information of each voice syllable in the voice syllable information, carrying out audio segmentation processing on the voice audio to obtain syllable segments corresponding to each voice syllable.

19. The method of claim 17, wherein synthesizing at least one of the accompanying audio and the video data with the target vocal audio to generate multimedia data comprises:

according to the beat information in the accompaniment audio and the time information of the video clips, stretching or compressing the video clips corresponding to the beats in the accompaniment audio to generate video data containing processed video clips;

and synthesizing the target voice audio, the accompaniment audio and the video data containing the processed video clip to generate the multimedia data.

20. A method for processing multimedia data, comprising:

acquiring audio data containing human voice audio according to triggering operation;

acquiring multimedia data generated according to the audio data, wherein the multimedia data is obtained by identifying voice syllable information of the audio data and processing voice syllables indicated by the voice syllable information according to node information used for voice matching in accompaniment audio and/or audio-free video data to obtain target voice audio and synthesizing at least one of the accompaniment audio and the audio-free video data with the target voice audio;

and providing the multimedia data on a display interface.

21. A multimedia data generating apparatus, characterized by comprising:

the first analysis module is used for carrying out spectrum energy spectrum and difference spectrum analysis on the acquired human voice audio to be processed and determining human voice syllable information in the human voice audio to be processed according to an analysis result;

the first target voice generation module is used for processing voice syllables indicated by the voice syllable information according to the beat information of the accompaniment audio so as to generate target voice audio matched with the accompaniment audio;

and the first synthesis module is used for synthesizing the target voice audio and the accompaniment audio to generate multimedia data.

22. A multimedia data generating apparatus, characterized by comprising:

the second analysis module is used for carrying out spectrum energy spectrum and difference spectrum analysis on the acquired human voice audio to be processed and determining human voice syllable information in the human voice audio to be processed according to an analysis result;

the second target voice generation module is used for acquiring node information used for voice matching in accompaniment audio and/or audio-free video data, and processing voice syllables indicated by the voice syllable information according to the node information to generate target voice audio;

and the second synthesis module is used for synthesizing at least one of the accompaniment audio and the video data with the target voice audio to generate multimedia data.

23. A multimedia data processing apparatus, comprising:

the acquisition module is used for acquiring audio data containing human voice audio and data to be synthesized according to the triggering operation;

the multimedia acquisition module is used for acquiring multimedia data generated according to the audio data, processing the voice syllables indicated by the voice syllable information by identifying the voice syllable information of the audio data and according to node information used for voice matching in the video data of the accompaniment audio and/or the non-audio to acquire target voice audio, and synthesizing at least one of the video data of the accompaniment audio and the non-audio with the target voice audio;

and the display module is used for providing the multimedia data on a display interface.

24. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the multimedia data generation method according to any one of claims 1 to 13, or execute the operation corresponding to the multimedia data generation method according to any one of claims 14 to 19, or execute the operation corresponding to the multimedia data processing method according to claim 20.

25. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements a multimedia data generation method as claimed in any one of claims 1 to 13, or implements a multimedia data generation method as claimed in any one of claims 14 to 19, or implements a multimedia data processing method as claimed in claim 20.