CN116030785A - Method, system, equipment and readable storage medium for synthesizing talking voice and video - Google Patents

Method, system, equipment and readable storage medium for synthesizing talking voice and video Download PDF

Info

Publication number
CN116030785A
CN116030785A CN202211727092.4A CN202211727092A CN116030785A CN 116030785 A CN116030785 A CN 116030785A CN 202211727092 A CN202211727092 A CN 202211727092A CN 116030785 A CN116030785 A CN 116030785A
Authority
CN
China
Prior art keywords
target
rap
user
audio
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211727092.4A
Other languages
Chinese (zh)
Inventor
李倍源
李文生
蒋海波
于洋
黄玮文
简康达
卢安
张龄宇
王恒岩
马金龙
盘子圣
黎智鑫
黄祥康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Quwan Network Technology Co Ltd
Original Assignee
Guangzhou Quwan Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Quwan Network Technology Co Ltd filed Critical Guangzhou Quwan Network Technology Co Ltd
Priority to CN202211727092.4A priority Critical patent/CN116030785A/en
Publication of CN116030785A publication Critical patent/CN116030785A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

The method provided by the embodiment of the application has no requirement on the duration of the voice recorded by the user, can generate the rap single-song video with the tone of the user without waiting for a long time after the user finishes the audio recording, and can also perform quality inspection on the audio data recorded by the user, so that the user can be reminded of re-recording the audio when the audio data recorded by the user is unqualified, the efficiency of synthesizing rap songs corresponding to the tone of the user is improved, and finally the synthesized rap songs matched with the tone of the user have higher musical performance and pleasure performance, and the details such as rhythm, pleasure feel and the like.

Description

Method, system, equipment and readable storage medium for synthesizing talking voice and video
Technical Field
The present disclosure relates to the field of audio and video data processing technologies, and in particular, to a method, a system, a device, and a readable storage medium for synthesizing a rap audio and video.
Background
In recent years, under the drive of the rap variety, more and more young people are attracted by rap, and many young people are enthusiastic to carry out trendy entertainment and consumption related to rap, and interest of young people for learning rap is increased. But the technical difficulty of the performance of the rap is different from that of the ordinary singing. In practice, it has been found that it is difficult to make a portrait of a pattern even if the lyrics of a portrayal song are familiar, but it is highly desirable to personally sing a favorite portrayal song.
The existing artificial intelligence imitates the song generation software of the tone of the user, can not meet the requirement of the user on generating the songs of the rap class, and the generated songs lack rap musicality such as rap rhythm, flow and the like and are inaudible. In addition, the prior art takes a long time for synthesizing a song, and the user must first upload at least a plurality of complete songs as learning materials, and generally wait a plurality of hours to complete training, so as to synthesize a new song.
Disclosure of Invention
The present application aims to solve at least one of the above technical drawbacks, and accordingly, the present application provides a method, a system, a device and a readable storage medium for synthesizing a rap audio and video, which are used for solving the technical drawbacks that it is difficult to synthesize a rap audio and video song in the prior art.
A method for synthesizing a rap audio and video includes:
responding to the operation of selecting a target synthetic text by a user, and determining the target synthetic text selected by the user;
according to the target synthetic text selected by the user, responding to the operation of clicking a recording button by the user, starting recording the audio data authored by the user, and after the user finishes recording, acquiring the original recording data of the user as target recording audio;
quality inspection is carried out on the target recorded audio to obtain quality inspection scoring results of the target recorded audio;
judging whether the quality inspection scoring result of the target recorded audio reaches the preset standard of synthesizing the rap audio and video;
if the quality inspection scoring result of the target recorded audio reaches the preset standard of synthesizing the rap audio and video, extracting the user voiceprint features corresponding to the target recorded audio;
converting the voice print characteristics of the user with a preset rap song template to obtain a target conversion result;
and mixing the target conversion result with preset accompaniment to obtain a target rap song matched with the tone of the user.
Preferably, the method further comprises:
if the quality inspection scoring result of the target recorded audio does not meet the preset standard of synthesizing the rap audio and video, reminding the user of re-recording the audio data, and acquiring the re-recorded audio data of the user as the target recorded audio after the user re-records the audio data;
And returning to perform the operation of quality inspection on the target recorded audio.
Preferably, after obtaining the target rap song matching the timbre of the user, the method further comprises:
randomly determining a background video sample from a preset background video library as a target background video, or taking the background video sample determined from the preset background video library as the target background video according to a user;
and merging the target rap song with the target background video to obtain the target rap music short piece.
Preferably, the creating process of the preset rap song template includes:
collecting a segment of a vocal song of a target vocal singer;
separating the vocal and vocal accompaniment from the vocal song segments of the target vocal singer;
labeling the singing words to obtain lyric labeling results of the singing song segments of the target singer;
processing the singing word labeling result according to a preset format to obtain a lyric file corresponding to the singing word labeling result;
extracting semantic features of voice audio in a voice song segment of the target voice singer;
and combining the singing word file with semantic features of voice audio in the singing song segments of the target singer to synthesize a singing song template.
A system for synthesizing a vocal music and video, which is applied to the method for synthesizing a vocal music and video in any one of the above description, the system comprising: a client and a server;
wherein,,
the client responds to the operations of clicking the user to enter an artificial intelligence rap interface and clicking a recording button, records target recording audio of the user according to rap lyrics read by the user, and uploads the target recording audio to the server after the user clicks to finish the recording operation;
the server detects the quality of the target recorded audio of the user by using a tone quality detection module, obtains a quality inspection grading result of the target recorded audio, returns the quality inspection grading result to the client, judges whether the target recorded audio meets the preset standard of synthesizing the rap audio and video according to the preset standard of synthesizing the rap audio and video and the quality inspection grading result, and sends the target recorded audio to a rap synthesizing service module of the server if the target recorded audio meets the preset standard of synthesizing the rap audio and video;
after receiving the target recorded audio, the voice synthesis service module carries out noise reduction and voice print extraction processing on the target recorded audio, reads semantic posterior probability features and fundamental frequency features in a preset voice song template after obtaining voice print feature vectors corresponding to the user, inputs the semantic posterior probability features and fundamental frequency features in the preset voice song template and voice print feature vectors corresponding to the user into a preset conversion model and a vocoder to carry out synthesis processing, obtains target voice print audio corresponding to the tone of the user, and simultaneously returns the target voice print audio and related video materials to the client;
And the client draws the rap audio singing by the tone of the user according to the target rap audio frequency and the related video materials returned by the rap synthesis service module of the server, and generates a target rap music short film taking the video materials returned by the server as the background.
Preferably, the system further comprises:
if the service end determines that the quality inspection scoring result of the target recorded audio does not reach the standard of the preset synthesized talking voice and video, sending feedback information for reminding the user to re-record audio data to the client;
and the client receives feedback information sent by the server and used for reminding the user of re-recording the audio data, reminds the user of re-recording the audio data, and acquires the re-recorded audio data of the user as target recording audio and sends the target recording audio to the server after the user finishes re-recording the audio data.
Preferably, the system further comprises an operational background,
the operation background is used for collecting target rap song segments of target rap singers, separating rap singers and accompaniment in the target rap song segments by utilizing an accompaniment voice separation tool, marking rap lyrics in the target rap song segments, uniformly processing the rap lyrics into lyrics files in a preset format according to requirements, and sending the rap sings and the accompaniment in the target rap song segments and the lyrics files in the preset format to the server;
The server receives the vocal sounds and accompaniments in the target vocal song segments and the lyric files in the preset format, extracts semantic features in the lyric files in the preset format by using a preset voice posterior probability model so as to be used for combining the vocal sounds and accompaniments in the target vocal song segments to synthesize a vocal song template, wherein the preset voice posterior probability model takes the lyric files of the trained vocal songs as training samples, takes the characteristics of the speaker semantics corresponding to the posterior probability of each voice category in each specific time frame in the lyric files of the trained vocal songs as sample tags, and is obtained through training.
Preferably, the preset format is a sound file format and a text document file format.
A rap audio-video synthesizing apparatus comprising: one or more processors, and memory;
the memory has stored therein computer readable instructions which, when executed by the one or more processors, implement the steps of the method of synthesizing audio and video as described in any of the preceding introduction.
A readable storage medium having stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to implement the steps of the method of singing an audiovisual composition as described in any of the preceding introduction.
According to the technical scheme, when a user needs to synthesize the rap music with the own tone characteristic, the method provided by the embodiment of the invention can respond to the operation of selecting the target synthetic text by the user, respond to the operation of clicking the record button by the user according to the target synthetic text selected by the user after determining the target synthetic text selected by the user, start recording the audio data created by the user, and acquire the original record data of the user as target recorded audio after the user finishes recording; quality inspection is carried out on the target recorded audio to obtain quality inspection scoring results of the target recorded audio; so as to judge whether the quality inspection scoring result of the target recorded audio reaches the preset standard of synthesizing the rap audio and video; if the quality inspection scoring result of the target recorded audio reaches the preset standard of synthesizing the rap audio and video, the target recorded audio can be utilized to synthesize rap music required by a user, so that after the quality inspection scoring result of the target recorded audio is determined to reach the preset standard of synthesizing the rap audio and video, the voice print characteristics of the user corresponding to the target recorded audio can be extracted; so that the voice print characteristics of the user and a preset rap song template can be converted to obtain a target conversion result; and finally, mixing the target conversion result with preset accompaniment, thereby obtaining the target rap song matched with the tone of the user.
According to the method provided by the embodiment of the application, the rap song matched with the tone of the user can be synthesized according to the audio data recorded by the user, the user does not need to rap, the user can complete the rap song synthesis only by normally sending out sound to complete the audio recording, the voice duration recorded by the user is not required, the rap single-song video with the tone of the user can be generated without waiting for a long time after the user completes the audio recording, and meanwhile, the quality of the audio data recorded by the user can be checked, so that the user can be reminded of re-recording the audio when the audio data recorded by the user are unqualified, the efficiency of synthesizing the rap song corresponding to the tone of the user is improved, and finally the rap song matched with the tone of the user has higher music performance and pleasure, and details such as rhythm and music feel.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments provided by the present application, and that other drawings may be obtained according to these drawings without any inventive effort for a person skilled in the art.
Fig. 1 is a system architecture diagram for implementing a rap audio/video synthesis according to an embodiment of the present application;
fig. 2 is a signaling flow chart for implementing a method for synthesizing a rap audio and video according to an embodiment of the present application;
fig. 3 is a flowchart of a method for implementing a rap audio/video synthesis according to an embodiment of the present application;
fig. 4 is a block diagram of a hardware structure of a rap audio/video synthesizer according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The prior art user generates an artificial intelligence song, consumes a lot of time, and cannot immediately give feedback to the user. At present, only popular songs can be generated, and the rhythm, flow and music sense of the rap type songs cannot be restored.
In addition, in the prior art, when collecting the acoustic material of the user, the user is often required to submit at least a plurality of songs which are completely singed to perform learning, training and analysis. In addition, after songs are generated in the prior art, most of matched background videos are simple fixed moving pictures or static pictures, and the matched background videos cannot be matched with various highlight videos with rich details.
In view of the fact that most of the current rap audio and video synthesis schemes are difficult to adapt to complex and changeable business requirements, the applicant researches a rap audio and video synthesis scheme, the rap audio and video synthesis method can synthesize rap songs matched with the tone of a user according to the audio data recorded by the user, the user does not need to rap, the user can complete audio recording by making a sound normally, the rap songs can be synthesized, the method provided by the embodiment of the invention has no requirement on the sound duration recorded by the user, the rap single-song video with the tone of the user can be generated without waiting a long time after the user finishes the audio recording, and meanwhile, the quality of the audio data recorded by the user can be checked, so that the user can be reminded of re-recording the audio when the audio data recorded by the user are unqualified, the efficiency of synthesizing rap songs corresponding to the tone of the user is improved, and finally synthesized rap songs matched with the tone of the user have higher musical performance, pleasant tone, rhythm, music and the like.
The methods provided by the embodiments of the present application may be used in a wide variety of general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor devices, distributed computing environments that include any of the above devices or devices, and the like.
The embodiment of the application provides a method for synthesizing a rap audio and video, which can be applied to various music editing systems or music management systems, and can also be applied to various computer terminals or intelligent terminals, wherein an execution subject can be a processor or a server of the computer terminal or the intelligent terminal.
A schematic diagram of a system architecture that may implement synthesizing a rap lyrics according to an embodiment of the present application is described below with reference to fig. 1, where, as shown in fig. 1, the system architecture may include: based on the system architecture shown in fig. 1, fig. 2 shows an optional signaling flow of the method for synthesizing the rap audio and video according to the embodiment of the present application, and referring to fig. 2, an interaction process among the client, the server and the operation background may be as follows:
the client can respond to the operations of clicking the artificial intelligence rap interface and clicking the record button by a user so as to record the target record audio of the user according to the rap lyrics read by the user, and can upload the target record audio to the server after the user clicks the operation of ending the record;
The server may include a print detection module and a rap synthesis service module.
In the practical application process, after the server receives the target recorded audio, the server can use a tone quality detection module to detect the quality of the target recorded audio of the user, so that a quality inspection scoring result of the target recorded audio can be obtained, and the quality inspection scoring result of the target recorded audio feeds back whether the quality of the target recorded audio meets the preset standard of synthesizing the rap audio and video, so that the server can return the quality inspection scoring result to the client after obtaining the quality inspection scoring result. Meanwhile, the server side can also judge whether the target recorded audio meets the preset synthetic rap audio-video standard according to the preset synthetic rap audio-video standard and by combining the quality inspection grading result, if the target recorded audio meets the preset synthetic rap audio-video standard, the target recorded audio can be used for synthesizing rap audio-video, so that after the target recorded audio is determined to meet the preset synthetic rap audio-video standard, the target recorded audio can be sent to a rap synthesizing service module of the server side;
After receiving the target recorded audio, the voice synthesis service module can perform noise reduction and voice print extraction processing on the target recorded audio, can read semantic posterior probability characteristics and fundamental frequency characteristics in a preset voice print song template after obtaining voice print characteristic vectors corresponding to the user, and can input the semantic posterior probability characteristics and fundamental frequency characteristics in the preset voice print song template and voice print characteristic vectors corresponding to the user into a preset conversion model and a vocoder to perform synthesis processing, so that the target voice print audio corresponding to the tone of the user can be obtained, and meanwhile, the target voice print audio and related video materials can be returned to the client together so that the client can synthesize the voice print song corresponding to the tone of the user.
After receiving the target vocal audio and the related video materials returned by the vocal synthesis service module of the server, the client can further draw the vocal audio singing by the tone of the user according to the target vocal audio and the related video materials returned by the vocal synthesis service module of the server, and can generate a target vocal music clip taking the video materials returned by the server as the background.
If the service end determines that the quality inspection scoring result of the target recorded audio does not reach the standard of the preset synthesized talking voice and video, feedback information for reminding the user to re-record audio data can be sent to the client;
after receiving feedback information sent by the server side and used for reminding the user of re-recording the audio data, the client side can analyze the feedback information and remind the user of re-recording the audio data, and can acquire the re-recorded audio data of the user as target recorded audio and send the target recorded audio to the server side after the user finishes re-recording the audio data.
The operation background can collect target rap song segments of target rap singers, can utilize an accompaniment voice separation tool to separate rap voice and accompaniment in the target rap song segments, can label rap lyrics in the target rap song segments, uniformly process the rap lyrics into lyrics files with preset formats according to requirements, and can send the rap voice and accompaniment in the target rap song segments and the lyrics files with preset formats to the server;
Wherein, the preset formats are a sound file format and a text document file format.
After receiving the vocal and accompaniment in the target vocal song segment and the lyric file in the preset format, the server side can extract semantic features in the lyric file in the preset format by using a preset voice posterior probability model so as to be used for combining the vocal and accompaniment in the target vocal song segment to synthesize a vocal song template, wherein the preset voice posterior probability model can be obtained by training by taking the lyric file of a trained vocal song as a training sample and taking the characteristics of speaker semantics corresponding to posterior probability of each voice category in each specific time frame in the lyric file of the trained vocal song as a sample tag.
The speech posterior probability is a matrix of time versus categories that may represent the posterior probability of each speech category for each particular time frame of a sentence. The posterior probability of an individual phoneme as a function of time is called the posterior trajectory.
In the practical application process, the acoustic features can be converted into speaker-independent features through an automatic voice recognition system.
According to the technical scheme, the rap audio and video synthesizing system provided by the embodiment of the application can synthesize rap songs matched with the tone of the user according to the audio data recorded by the user, the user does not need to rap, the user can complete the rap songs by normally making sound to complete the audio recording, the system provided by the embodiment of the application has no requirement on the sound duration recorded by the user, the rap single-song video with the tone of the user can be generated without waiting for a long time after the user completes the audio recording, and meanwhile, the quality of the audio data recorded by the user can be checked, so that the user can be reminded of re-recording the audio when the audio data recorded by the user are unqualified, the efficiency of synthesizing rap songs corresponding to the tone of the user is improved, and finally synthesized rap songs matched with the tone of the user have higher musical performance and the pleasant tone, and the details such as rhythm and happy feeling.
The rap audio synthesized by the rap audio and video synthesis system provided by the embodiment of the application can be matched with different cool and dazzling video backgrounds to synthesize rap music shortages with rich video backgrounds, the synthesized rap music video backgrounds are not simple cyclic moving pictures or static pictures, but true highlight videos with contents, and the process of storing the synthesized rap music shortages is music videos drawn frame by frame, so that the synthesized rap music shortages have more individuality and better ornamental effect.
The following describes, in conjunction with fig. 2 and the above-described rap audio/video synthesis system, a flow of a rap audio/video synthesis method provided in an embodiment of the present application, as shown in fig. 2, where the flow may include the following steps:
step S101, responding to the operation of selecting the target synthesized text by the user, and determining the target synthesized text selected by the user.
Specifically, in the practical application process, when a user wants to record a rap song, the lyrics corresponding to the rap song to be recorded are generally determined first.
In order to meet different user requirements, synthetic text formed by lyrics corresponding to the vocal songs of different singers is generally stored in a server in advance.
The user can select different synthesized texts to record own audio data according to own preference.
Thus, the synthesized text may include lyric text corresponding to different types of rap songs.
When a user wants to record a rap song belonging to the user, the method provided by the embodiment of the application can respond to the operation of selecting the target synthetic text by the user to determine the target synthetic text selected by the user, so that the read-aloud audio of the user can be recorded according to the target synthetic text.
Step S102, according to the target synthetic text selected by the user, responding to the operation of clicking a recording button by the user, starting to record the audio data authored by the user, and after the user finishes recording, acquiring the original recording data of the user as target recording audio.
Specifically, as can be seen from the above description, the method provided by the embodiment of the present application may determine the target synthetic text, where the target synthetic text is a lyric text corresponding to a rap song that the user wants to record.
Therefore, after the target synthetic text is determined, the user can start to record the audio data created by the user according to the target synthetic text selected by the user and respond to the operation of clicking the record button by the user, and after the user finishes recording, the original record data of the user is obtained as target recorded audio, so that the talking song which the user wants to record can be synthesized according to the target recorded audio.
Wherein,,
the target recorded audio may include a tone characteristic of the user.
In the practical application process, when the audio data is recorded, the user does not need to talk and sing, and the user only needs to make a sound normally and complete the audio recording according to the target synthetic text.
For example, after the user selects the target synthetic text, the user can click a recording button on the interface, so that the audio data created by the user can be started to be recorded, after the user clicks the recording button, the selected target synthetic text can be read, and after the user finishes reading the target synthetic text, the user can click a recording ending button, so that the task of recording the audio data of the user can be ended, and the application can acquire and save the original recording data of the user as target recording audio.
Step S103, quality inspection is carried out on the target recorded audio to obtain quality inspection scoring results of the target recorded audio.
Specifically, after the target recorded audio is obtained, if the user records a noise or does not follow the target synthetic text while recording the target recorded audio, the finally generated target recorded audio is difficult to synthesize a rap song that the user wants to synthesize.
Therefore, after the target recorded audio is determined, the target recorded audio can be subjected to quality inspection to obtain quality inspection scoring results of the target recorded audio, so that whether the target recorded audio meets the preset standard of synthesizing the rap song can be determined through the quality inspection scoring results corresponding to the target recorded audio.
The quality inspection scoring result can reflect the quality of the target recorded audio data, and the higher the quality inspection scoring, the higher the quality of the target recorded audio.
Step S104, judging whether the quality inspection scoring result of the target recorded audio reaches the preset standard of synthesizing the rap audio and video.
Specifically, since the quality inspection scoring result may reflect the quality of the target recorded audio, after determining the quality inspection scoring result, it may be further determined whether the quality inspection scoring result of the target recorded audio meets a preset criterion for synthesizing a rap audio and video.
In the actual application process, if the quality inspection scoring result of the target recorded audio reaches the preset standard of synthesizing the rap audio and video, it is indicated that the target recorded audio can be used for synthesizing the rap song that the user wants to synthesize, and step S105 can be executed;
if the quality inspection scoring result of the target recorded audio does not reach the preset criteria for synthesizing the rap audio and video, it is indicated that the target recorded audio cannot be used for synthesizing the rap song that the user wants to synthesize, and step S106 may be executed.
Step S105, extracting the user voiceprint features corresponding to the target recorded audio.
Specifically, as can be seen from the above description, when the quality inspection scoring result of the target recorded audio reaches the preset criteria for synthesizing the rap audio and video, the target recorded audio can be utilized to synthesize the rap song that the user wants to synthesize, so that the user voiceprint feature corresponding to the target recorded audio can be extracted, so that the rap song corresponding to the tone feature of the user can be synthesized according to the user voiceprint feature corresponding to the target recorded audio.
Wherein,,
the user voiceprint feature can embody a user's tone color feature. Determining the user voiceprint feature can facilitate synthesizing a rap song having a user's timbre feature.
Step S106, reminding the user of re-recording the audio data, and after the user finishes re-recording the audio data, acquiring the re-recorded audio data of the user as target recorded audio.
Specifically, if the quality inspection scoring result of the target recorded audio does not reach the preset standard of synthesizing the rap audio and video, the target recorded audio is not used for synthesizing the rap song which the user wants to synthesize, the user can be reminded to re-record the audio data, the re-recorded audio data of the user is obtained as the target recorded audio after the re-recording of the audio data of the user is completed, and meanwhile, the operation of inspecting the quality of the target recorded audio is returned, so that the recorded audio data of the user meeting the preset standard of synthesizing the rap audio and video can be obtained.
And step S107, converting the voiceprint characteristics of the user with a preset rap song template to obtain a target conversion result.
Specifically, as can be seen from the above description, the voiceprint features of the user may represent tone features of the user. Therefore, in the actual application process, after the user voiceprint feature is determined, the user voiceprint feature and a preset rap song template can be converted, so that a target conversion result can be obtained.
Wherein the preset rap song template can be set with reference to different rap songs of rap singers of different styles.
Step S108, mixing the target conversion result with preset accompaniment to obtain target rap songs matched with the tone of the user.
Specifically, in the actual application process, the target conversion result is only a rap song template which is obtained by converting the vocal print characteristics of the user and a preset rap song template and is matched with the tone characteristics of the user.
The rap song template alone does not yet create the effect of the rap song that the user wants. Accordingly, after the target conversion result is determined, the target conversion result can be mixed with a preset accompaniment, whereby a target rap song matching the timbre of the user can be obtained.
In the practical application process, after the user finishes recording the audio, the user can convert the recorded audio into the rap song wanted by the user without waiting for a long time, and the efficiency of synthesizing the rap song by the user can be effectively improved.
According to the technical scheme, the rap audio and video synthesizing method can synthesize rap songs matched with the tone of the user by processing according to the audio data recorded by the user, the user does not need to rap, the user can complete the rap songs by normally making sound to complete the audio recording, the system provided by the embodiment of the application has no requirement on the sound duration recorded by the user, the rap single-song video with the tone of the user can be generated without waiting for a long time after the user completes the audio recording, and meanwhile, the quality of the audio data recorded by the user can be checked, so that the user can be reminded of re-recording the audio when the audio data recorded by the user are unqualified, the efficiency of synthesizing rap songs corresponding to the tone of the user is improved, and finally the synthesized rap songs matched with the tone of the user have higher musical performance and the pleasant tone, and the details such as rhythm and happiness.
The rap audio synthesized by the method provided by the embodiment of the application can be matched with different cool video backgrounds to synthesize rap music shortages with rich video backgrounds, the synthesized rap music video backgrounds are not simple cyclic moving pictures or static pictures, but are true highlight videos with contents, and the process of storing the synthesized rap music shortages is music video drawn frame by frame, so that the synthesized rap music shortages have better individuality and ornamental effect.
In the practical application process, after obtaining the target rap song matched with the tone of the user, in order to obtain better and more personalized rap songs, the method provided by the embodiment of the application can also configure different background videos for the synthesized rap songs so as to improve the ornamental effect. The process is described next, which may include the following steps:
step S201, randomly determining a background video sample from a preset background video library as a target background video, or taking the background video sample determined from the preset background video library by a user as the target background video.
Specifically, in the actual application process, the preference of different users is different, after the target rap song matching with the timbre of the user is obtained, the method provided by the embodiment of the application can randomly determine a background video sample from the preset background video library as a target background video, or according to the background video sample determined by the user from the preset background video library as a target background video, so as to combine the target rap song with the target background video to construct a rap music clip.
For example, after determining a target rap song matching with the timbre of the user, the method provided by the embodiment of the application can randomly determine a background video sample from a preset background video library for the user to select, if the user agrees to randomly determine a background video sample from the preset background video library, randomly determine a background video sample from the preset background video library as a target background video, and if the user does not agree to randomly determine a background video sample from the preset background video library, the method can be used as the target background video according to the background video sample determined from the preset background video library by the user.
And step S202, merging the target rap song with the target background video to obtain a target rap music short piece.
Specifically, in the practical application process, after the target background video is determined, the description can use the target background video to make a rap music short-piece. Thus, the target rap song can be combined with the target background video, whereby a target rap music clip can be obtained.
According to the technical scheme, after the target rap song matched with the tone of the user is obtained, the method provided by the embodiment of the application can randomly determine a background video sample from the preset background video library as a target background video, or according to the background video sample determined by the user from the preset background video library as a target background video, and combine the target rap song with the target background video to obtain a target rap music short piece, so that the synthesized rap music short piece finished product has better individuality and ornamental effect.
As can be seen from the foregoing description, the method provided by the embodiments of the present application may create different rap song templates for the user to select, and the creation process of the preset rap song templates is described next, and the process may include the following steps:
in step S301, a piece of a rap song of a target rap singer is collected.
Specifically, in the practical application process, when templates of different styles of rap songs are required to be synthesized, the rap song segments of the target rap singer can be collected by the method provided by the embodiment of the application, so that the rap song templates similar to the rap songs of the target rap singer can be synthesized according to the rap song segments of the target rap singer.
The method provided by the embodiment of the application does not need to collect a large amount of complete rap songs of the target rap singer, can collect part rap song fragments of the target rap singer, can manufacture rap song templates similar to rap song styles of the target rap singer, can effectively save time for synthesizing rap song templates, and improves efficiency for synthesizing rap song templates.
Step S302, separating the vocal and vocal accompaniment and the vocal lyrics in the vocal song segments of the target vocal singer.
Specifically, in the practical application process, after the rap song segments of the target rap singer are obtained, the rap song segments of the target rap singer may be further processed, so that a rap song template may be formed.
For example, the vocal and vocal accompaniment and the vocal lyrics may be separated in a segment of the vocal song for the target vocal singer.
And step S303, marking the singing words to obtain lyric marking results of the rap song segments of the target rap singer.
Specifically, after obtaining the lyrics of the target singer's segment of the singing song, in order to better correspond the vocal audio of the target singer to the singing words, the singing words may be labeled, so as to obtain the lyric labeling result of the target singer's segment of the singing song, so that the template of the singing song may be synthesized according to the lyrics of the target singer's segment of the singing song.
Step S304, processing the singing word labeling result according to a preset format to obtain a lyric file corresponding to the singing word labeling result.
Specifically, after the singing word is obtained, in order to better extract the sound characteristics of the target singer according to the singing word, the singing word labeling result may be processed according to a preset format, so as to obtain a lyric file corresponding to the singing word labeling result.
Wherein,,
the annotation result of the singing words is processed according to a preset format, so that the deep learning model can be used for forming the rap song templates in batches, and the efficiency and the number of rap song templates are improved.
The preset format may be a sound file format and a text document file format.
Step S305 extracts semantic features of vocal audio in a segment of a vocal song of the target vocal singer.
Specifically, in the practical application process, after separating the vocal and the vocal accompaniment and the vocal lyrics in the vocal song segments of the target vocal singer, semantic features of the vocal audio in the vocal song segments of the target vocal singer may also be extracted.
The semantic features of the vocal audio in the target vocal song segment of the target vocal singer may well interpret the vocal song style of the target vocal singer, and determining the semantic features of the vocal audio in the vocal song segment of the target vocal singer may facilitate generating a vocal song template similar to the semantic features of the vocal audio in the vocal song segment of the target vocal singer with reference to the semantic features of the vocal audio in the vocal song segment of the target vocal singer.
Step S306, combining the singing word file and semantic features of voice audio in the target singer' S singing song segment to synthesize a singing song template.
Specifically, in the practical application process, semantic features of voice audio in a voice song segment of the target singer can feed back the style and characteristics of the target singer, and a singing song template similar to a singing song of the target singer can be generated according to the semantic features of voice audio in the voice song segment of the target singer.
Thus, after determining the semantic features of the vocal audio in the singing word file and the target singer's segment of the vocal song, a vocal song template may be synthesized in combination with the semantic features of the vocal audio in the singing word file and the target singer's segment of the vocal song.
According to the technical scheme, the method provided by the embodiment of the application can collect the rap song segments of the target rap singer; and separating the vocal and vocal accompaniment from the vocal song segments of the target vocal singer; after the singing words of the target singer are obtained, the singing words of the target singer are further marked, and a lyric marking result of the singing song segment of the target singer is obtained; processing the singing word labeling result according to a preset format to obtain a lyric file corresponding to the singing word labeling result; semantic features of vocal audio in a segment of a vocal song of the target vocal singer can also be extracted; finally, the singing word file and semantic features of voice audios in the target singing song segments of the target singer can be combined to form a singing song template.
The specific process flow included in the above-mentioned method for synthesizing the rap audio and video may be described with reference to the above-mentioned rap audio and video synthesizing system, and will not be described herein.
Further, in the actual application process, the corresponding rap audio/video device capable of implementing the processing process of the rap audio/video method provided by the embodiment of the application can be generated according to the rap audio/video synthesis method provided by the embodiment of the application.
The device for synthesizing the rap audio and video generated by the rap audio and video synthesizing method provided by the embodiment of the application can be applied to rap audio and video synthesizing equipment, such as a terminal: cell phones, computers, etc. Optionally, fig. 3 shows a block diagram of a hardware structure of the rap audio and video synthesizing apparatus, and referring to fig. 3, the hardware structure of the rap audio and video synthesizing apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4.
In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete communication with each other through the communication bus 4.
The processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present application, etc.;
The memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to: and realizing each processing flow in the terminal rap audio and video synthesis scheme.
The embodiment of the application also provides a readable storage medium, which can store a program suitable for being executed by a processor, the program being configured to: and realizing each processing flow of the terminal in the rap audio and video synthesis scheme.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. The various embodiments may be combined with one another. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for synthesizing a rap audio and video is characterized by comprising the following steps:
responding to the operation of selecting a target synthetic text by a user, and determining the target synthetic text selected by the user;
according to the target synthetic text selected by the user, responding to the operation of clicking a recording button by the user, starting recording the audio data authored by the user, and after the user finishes recording, acquiring the original recording data of the user as target recording audio;
Quality inspection is carried out on the target recorded audio to obtain quality inspection scoring results of the target recorded audio;
judging whether the quality inspection scoring result of the target recorded audio reaches the preset standard of synthesizing the rap audio and video;
if the quality inspection scoring result of the target recorded audio reaches the preset standard of synthesizing the rap audio and video, extracting the user voiceprint features corresponding to the target recorded audio;
converting the voice print characteristics of the user with a preset rap song template to obtain a target conversion result;
and mixing the target conversion result with preset accompaniment to obtain a target rap song matched with the tone of the user.
2. The method according to claim 1, characterized in that the method further comprises:
if the quality inspection scoring result of the target recorded audio does not meet the preset standard of synthesizing the rap audio and video, reminding the user of re-recording the audio data, and acquiring the re-recorded audio data of the user as the target recorded audio after the user re-records the audio data;
and returning to perform the operation of quality inspection on the target recorded audio.
3. The method of claim 1, wherein after obtaining the target rap song matching the timbre of the user, the method further comprises:
Randomly determining a background video sample from a preset background video library as a target background video, or taking the background video sample determined from the preset background video library as the target background video according to a user;
and merging the target rap song with the target background video to obtain the target rap music short piece.
4. The method of claim 1, wherein the creation of the preset rap song template includes:
collecting a segment of a vocal song of a target vocal singer;
separating the vocal and vocal accompaniment from the vocal song segments of the target vocal singer;
labeling the singing words to obtain lyric labeling results of the singing song segments of the target singer;
processing the singing word labeling result according to a preset format to obtain a lyric file corresponding to the singing word labeling result;
extracting semantic features of voice audio in a voice song segment of the target voice singer;
and combining the singing word file with semantic features of voice audio in the singing song segments of the target singer to synthesize a singing song template.
5. A system for synthesizing a vocal audio and video, applied to the method for synthesizing a vocal audio and video according to any one of claims 1 to 4, comprising: a client and a server;
wherein,,
the client responds to the operations of clicking the user to enter an artificial intelligence rap interface and clicking a recording button, records target recording audio of the user according to rap lyrics read by the user, and uploads the target recording audio to the server after the user clicks to finish the recording operation;
the server detects the quality of the target recorded audio of the user by using a tone quality detection module, obtains a quality inspection grading result of the target recorded audio, returns the quality inspection grading result to the client, judges whether the target recorded audio meets the preset standard of synthesizing the rap audio and video according to the preset standard of synthesizing the rap audio and video and the quality inspection grading result, and sends the target recorded audio to a rap synthesizing service module of the server if the target recorded audio meets the preset standard of synthesizing the rap audio and video;
after receiving the target recorded audio, the voice synthesis service module carries out noise reduction and voice print extraction processing on the target recorded audio, reads semantic posterior probability features and fundamental frequency features in a preset voice song template after obtaining voice print feature vectors corresponding to the user, inputs the semantic posterior probability features and fundamental frequency features in the preset voice song template and voice print feature vectors corresponding to the user into a preset conversion model and a vocoder to carry out synthesis processing, obtains target voice print audio corresponding to the tone of the user, and simultaneously returns the target voice print audio and related video materials to the client;
And the client draws the rap audio singing by the tone of the user according to the target rap audio frequency and the related video materials returned by the rap synthesis service module of the server, and generates a target rap music short film taking the video materials returned by the server as the background.
6. The system of claim 5, further comprising:
if the service end determines that the quality inspection scoring result of the target recorded audio does not reach the standard of the preset synthesized talking voice and video, sending feedback information for reminding the user to re-record audio data to the client;
and the client receives feedback information sent by the server and used for reminding the user of re-recording the audio data, reminds the user of re-recording the audio data, and acquires the re-recorded audio data of the user as target recording audio and sends the target recording audio to the server after the user finishes re-recording the audio data.
7. The system of claim 5, further comprising an operational background,
the operation background is used for collecting target rap song segments of target rap singers, separating rap singers and accompaniment in the target rap song segments by utilizing an accompaniment voice separation tool, marking rap lyrics in the target rap song segments, uniformly processing the rap lyrics into lyrics files in a preset format according to requirements, and sending the rap sings and the accompaniment in the target rap song segments and the lyrics files in the preset format to the server;
The server receives the vocal sounds and accompaniments in the target vocal song segments and the lyric files in the preset format, extracts semantic features in the lyric files in the preset format by using a preset voice posterior probability model so as to be used for combining the vocal sounds and accompaniments in the target vocal song segments to synthesize a vocal song template, wherein the preset voice posterior probability model takes the lyric files of the trained vocal songs as training samples, takes the characteristics of the speaker semantics corresponding to the posterior probability of each voice category in each specific time frame in the lyric files of the trained vocal songs as sample tags, and is obtained through training.
8. The system of claim 7, wherein the predetermined format is a sound file format and a text document file format.
9. A rap audio-video synthesizing apparatus, comprising: one or more processors, and memory;
stored in the memory are computer readable instructions which, when executed by the one or more processors, implement the steps of the method of singing an audiovisual composition as claimed in any one of claims 1 to 4.
10. A readable storage medium, characterized by: the readable storage medium has stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to implement the steps of the method of synthesizing audio and video as described in any one of claims 1 to 4.
CN202211727092.4A 2022-12-30 2022-12-30 Method, system, equipment and readable storage medium for synthesizing talking voice and video Pending CN116030785A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211727092.4A CN116030785A (en) 2022-12-30 2022-12-30 Method, system, equipment and readable storage medium for synthesizing talking voice and video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211727092.4A CN116030785A (en) 2022-12-30 2022-12-30 Method, system, equipment and readable storage medium for synthesizing talking voice and video

Publications (1)

Publication Number Publication Date
CN116030785A true CN116030785A (en) 2023-04-28

Family

ID=86079449

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211727092.4A Pending CN116030785A (en) 2022-12-30 2022-12-30 Method, system, equipment and readable storage medium for synthesizing talking voice and video

Country Status (1)

Country Link
CN (1) CN116030785A (en)

Similar Documents

Publication Publication Date Title
EP3616190B1 (en) Automatic song generation
CN108806655B (en) Automatic generation of songs
Stowell et al. Detection and classification of acoustic scenes and events
US10977299B2 (en) Systems and methods for consolidating recorded content
CN101996627B (en) Speech processing apparatus, speech processing method and program
Rubin et al. Content-based tools for editing audio stories
Yang et al. Music emotion recognition
Prechelt et al. An interface for melody input
Lidy et al. On the suitability of state-of-the-art music information retrieval methods for analyzing, categorizing and accessing non-western and ethnic music collections
Schuller et al. ‘Mister DJ, Cheer Me Up!’: Musical and textual features for automatic mood classification
CN108388926A (en) The determination method and apparatus of interactive voice satisfaction
Gómez et al. Phenicx: Performances as highly enriched and interactive concert experiences
CN113676772A (en) Video generation method and device
JP4697432B2 (en) Music playback apparatus, music playback method, and music playback program
Ricard et al. Morphological sound description: Computational model and usability evaluation
CN116030785A (en) Method, system, equipment and readable storage medium for synthesizing talking voice and video
US20060248105A1 (en) Interactive system for building and sharing databank
JP5085577B2 (en) Playlist creation device, music playback device, playlist creation method, and playlist creation program
Eronen Signal processing methods for audio classification and music content analysis
Ranjan et al. Using a bi-directional lstm model with attention mechanism trained on midi data for generating unique music
O'Grady Studio-based songwriting: Music production and shaping the pop song
Fan et al. Contour: an efficient voice-enabled workflow for producing text-to-speech content
Peeters et al. A Multimedia Search and Navigation Prototype, Including Music and Video-clips.
Moffat Evaluation of Synthesised Sound Effects
Amoah An algorithm for multi tempo music lyric transcription

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination