CN112382274A

CN112382274A - Audio synthesis method, device, equipment and storage medium

Info

Publication number: CN112382274A
Application number: CN202011270710.8A
Authority: CN
Inventors: 汤本来; 顾宇; 殷翔; 李忠豪
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-02-19

Abstract

The application discloses an audio synthesis method, an audio synthesis device, audio synthesis equipment and a storage medium, and relates to the field of voice synthesis. The specific implementation scheme is as follows: acquiring audio to be synthesized and target audio; determining corresponding linguistic features based on the audio to be synthesized; determining a target tone color based on the target audio; determining acoustic features based on the audio to be synthesized, the linguistic features and the target timbre; and synthesizing the target tone audio based on the acoustic features and outputting the target tone audio. The realization mode can accurately and quickly synthesize the audio with the target tone corresponding to the audio to be synthesized by the linguistic characteristics determined according to the audio to be synthesized and the target tone determined according to the target audio, simplify the audio synthesis process and improve the accuracy of audio synthesis of specific tones.

Description

Audio synthesis method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech synthesis, specifically to the field of natural language processing, computer technology, artificial intelligence, and deep learning technology, and more particularly to an audio synthesis method, apparatus, device, and storage medium.

Background

In recent years, due to rapid development of online education and online learning, audio synthesis technology has been widely studied and paid attention, and audio synthesis is intended to synthesize audio of a certain user into audio of different accents or different timbres or both of the accents and the timbres. The audio synthesis technology has a great application prospect in entertainment. The audio synthesis using the existing audio synthesis technology is slow and the result of audio synthesis is often inaccurate.

Disclosure of Invention

The present disclosure provides an audio synthesis method, apparatus, device, and storage medium.

According to an aspect of the present disclosure, there is provided an audio synthesizing method including: acquiring audio to be synthesized and target audio; determining corresponding linguistic features based on the audio to be synthesized; determining a target tone color based on the target audio; determining acoustic features based on the audio to be synthesized, the linguistic features and the target timbre; and synthesizing the target tone audio based on the acoustic features and outputting the target tone audio.

According to another aspect of the present disclosure, there is provided an audio synthesizing apparatus including: an acquisition unit configured to acquire audio to be synthesized and target audio; a linguistic feature determination unit configured to determine a corresponding linguistic feature based on the audio to be synthesized; a target tone determination unit configured to determine a target tone based on the target audio; an acoustic feature determination unit configured to determine an acoustic feature based on the audio to be synthesized, the linguistic feature, and the target timbre; and a synthesizing unit configured to synthesize the target timbre audio based on the acoustic features and output.

According to still another aspect of the present disclosure, there is provided an audio synthesis electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the audio synthesis method as described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the audio synthesis method as described above.

According to the technology of the application, the problem that audio synthesis cannot be accurately and quickly carried out is solved, the audio which corresponds to the audio to be synthesized and has the target tone can be accurately and quickly synthesized through the linguistic characteristics determined according to the audio to be synthesized and the target tone determined according to the target audio, the audio synthesis process is simplified, and the accuracy of audio synthesis of specific tones is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of an audio synthesis method according to the present application;

FIG. 3 is a schematic diagram of an application scenario of an audio synthesis method according to the present application;

FIG. 4 is a flow diagram of another embodiment of an audio synthesis method according to the present application;

FIG. 5 is a schematic block diagram of an embodiment of an audio synthesis apparatus according to the present application;

fig. 6 is a block diagram of an electronic device for implementing an audio synthesis method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the audio synthesis method or audio synthesis apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a speech synthesis application, may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, car computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background server that processes audio to be synthesized and target audio collected by the

terminal devices

101, 102, 103. The background server can acquire the audio to be synthesized and the target audio, and determine corresponding linguistic characteristics based on the audio to be synthesized; determining a target tone color based on the target audio; determining acoustic features based on the audio to be synthesized, the linguistic features and the target timbre; and synthesizing the target tone audio based on the acoustic features and outputting the target tone audio.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as a plurality of software or software modules, or as a single software or software module. And is not particularly limited herein.

It should be noted that the audio synthesis method provided by the embodiment of the present application is generally executed by the server 105. Accordingly, the audio synthesizing apparatus is generally provided in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of an audio synthesis method according to the present application is shown. The audio synthesis method of the embodiment comprises the following steps:

step 201, acquiring the audio to be synthesized and the target audio.

In this embodiment, an execution subject of the audio synthesis method (for example, the server 105 in fig. 1) may obtain the audio to be synthesized from the local, or may obtain the audio to be synthesized collected by the terminal device in an audio recording manner through a wired connection or a wireless connection. Specifically, the audio to be synthesized may be a sentence or a song that the user arbitrarily speaks, and the content of the audio to be synthesized is not particularly limited in the present application. The audio to be synthesized can be in the form of MP3 or MP4, and the storage form of the audio to be synthesized is not limited in the present application. The target audio may be audio corresponding to the timbre to be converted. For example, the target audio may be an audio recorded with a sound of little a classmate, or an audio recorded with a sound of little B classmate, and the tone color of the target audio is not particularly limited in the present application. The target audio may be in the form of MP3 or MP4, and the storage form of the target audio is not limited in the present application. It is understood that the target audio may be human audio or audio of other living creatures in nature, and the source of the target audio is not particularly limited in this application. The audio to be synthesized may be vocal stem. The target audio may also be vocal stem. The singing stem sound may be pure vocal without music without post-processing, but may also be pure vocal without music of other creatures. The audio to be synthesized may be language statement audio/speaking audio, or may be music audio/singing audio.

Step 202, determining corresponding linguistic features based on the audio to be synthesized.

After the execution subject obtains the audio to be synthesized, the execution subject may determine the corresponding linguistic feature based on the audio to be synthesized. In particular, the linguistic features may include prosodic features, syntax, structure of speech pieces, structure of information, and the like. The prosodic feature may be a super-sound feature or a super-sound segment feature, which is a sound system structure of the language. Prosodic features can be divided into three main aspects: intonation, time domain distribution and stress are realized through the characteristics of the ultrasonic segment. The super-range features include pitch, intensity, and temporal characteristics, loaded by a phoneme or group of phonemes. Prosody is a typical feature of human natural language and has many features common across languages, such as: pitch downtilt, rereading, pauses, etc. are common among different languages. Prosodic features are one of the important forms of language and emotional expression. Specifically, the execution subject may obtain the historical synthesized audio, and the linguistic features corresponding to the historical synthesized audio. The execution subject may compare the audio to be synthesized with the historical synthesized audio, and determine the linguistic feature corresponding to the historical synthesized audio having a similarity greater than a preset value with the audio to be synthesized as the linguistic feature corresponding to the audio to be synthesized. When calculating the similarity between the audio to be synthesized and the historical synthesized audio, the execution subject may compare the audio to be synthesized with each phoneme in each historical synthesized audio, and in response to determining that the probability of the same phonemes in the audio to be synthesized and each phoneme in each historical synthesized audio are greater than a preset value, that is, determining that the similarity between the audio to be synthesized and each phoneme in each historical synthesized audio is greater than a preset value, the linguistic feature corresponding to the historical synthesized audio participating in the similarity comparison may be determined as the linguistic feature of the audio to be synthesized.

Step 203, determining the target tone color based on the target audio.

After acquiring the target audio, the execution subject may determine the target tone based on the target audio. The audio comprises a plurality of phonemes. Phonemes are the smallest phonetic units divided according to the natural attributes of speech, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. Phonemes are divided into two major categories, vowels and consonants. For example, the syllables o (ā) in chinese have only one phoneme, the love (aji) has two phonemes, the generation (d aji) has three phonemes, etc. For example, in [ ma-mi ], the two [ m ] pronunciations are identical and are identical phonemes, and [ a ] i is different and is different phoneme. Timbre (Timbre) means that different sound shows the characteristics that are always distinctive in terms of waveform, and different object vibrations have different characteristics. Different sounding bodies have different materials and structures, so the tone of the sounding is different. For example, pianos, violins and people make different sounds, and everyone makes different sounds. Thus, tone color can be understood as a characteristic of sound. Timbre is one of the attributes of sound (i.e., loudness, pitch, timbre) and is primarily determined by its overtones. The difference between the sound of each person and the sound produced by each instrument is caused by the difference in tone. The vibration of the sounding body is composed of a plurality of overtones, wherein the overtones are different from the fundamental tones, and a specific tone is determined. Besides a ' fundamental tone ', the sound is naturally mixed with a plurality of different ' frequencies ' (the frequency of 1 second vibration of a vibrating object) and overtones ', so that different timbres are determined, and a person can distinguish different sounds after hearing the sound. Specifically, the execution subject may determine, according to the target audio and a pre-trained classification model, an identifier corresponding to each phoneme in the target audio, where the pre-trained classification model is used to represent a correspondence between the phoneme and the identifier; determining fundamental tones and overtones in the target audio according to the determined identifiers corresponding to the phonemes; and determining the target tone according to the determined fundamental tone and overtone in the target audio. Specifically, the execution subject may input the determined fundamental tone and overtone in the target audio into a pre-trained tone conversion model, and output a tone corresponding to the fundamental tone and the overtone, where the pre-trained tone conversion model is used to represent a corresponding relationship between the fundamental tone, the overtone, and the tone. For example, the tone conversion model may be a pre-trained Convolutional Neural Network (CNN).

Step 204, determining acoustic features based on the audio to be synthesized, the linguistic features and the target timbre.

After obtaining the linguistic features, the execution subject may determine the acoustic features based on the audio to be synthesized, the linguistic features, and the target timbre. Specifically, the executing entity may input the audio to be synthesized, the linguistic feature, and the target timbre into a pre-trained transformation model, and output the corresponding acoustic feature. The pre-trained conversion model is used for representing the corresponding relation between audio, linguistic characteristics, timbre and acoustic characteristics.

And step 205, synthesizing the target tone audio based on the acoustic features, and outputting.

After obtaining the acoustic features, the execution subject may synthesize the target audio based on the acoustic features and output the target audio. Specifically, the execution subject may synthesize the target audio according to the acoustic features and by combining the corresponding audio to be synthesized and the preset acoustic features, and the corresponding relationship between the audio to be synthesized and the audio with the target timbre, and output the target audio through the audio playing device.

With continued reference to fig. 3, a schematic diagram of one application scenario of the audio synthesis method according to the present application is shown. In the application scenario of fig. 3, a server 304 acquires audio 301 to be synthesized and target audio 302 through a network 303. The server 304 determines the corresponding linguistic feature 305 based on the audio 301 to be synthesized. The server 304 determines a target timbre 306 based on the target audio 302. The server 304 determines acoustic features 307 based on the audio 302 to be synthesized, the linguistic features 305, and the target timbre 306. The server 304 synthesizes the target timbre audio 308 based on the acoustic features 307 and outputs it.

According to the method and the device, the audio with the target tone corresponding to the audio to be synthesized can be accurately and quickly synthesized through the linguistic characteristics determined according to the audio to be synthesized and the target tone determined according to the target audio, the audio synthesis process is simplified, and the accuracy of audio synthesis of the specific tone is improved.

With continued reference to FIG. 4, a flow 400 of another embodiment of an audio synthesis method according to the present application is shown. As shown in fig. 4, the audio synthesizing method of the present embodiment may include the following steps:

step 401, acquiring an audio to be synthesized and a target audio.

Step 402, determining corresponding linguistic features based on the audio to be synthesized.

The principle of step 401 to step 402 is similar to that of step 201 to step 202, and is not described herein again.

Specifically, step 402 may be implemented by step 4021:

step 4021, determining the linguistic features corresponding to the audio to be synthesized according to the audio to be synthesized and the pre-trained recognition model.

In this embodiment, the pre-trained recognition model is used to characterize the correspondence between the audio and the linguistic features. After the execution main body obtains the audio to be synthesized, the linguistic feature corresponding to the audio to be synthesized can be determined according to the audio to be synthesized and the pre-trained recognition model. Specifically, the executing entity may input the audio to be synthesized into the pre-trained recognition model, and output the linguistic feature corresponding to the audio to be synthesized. Linguistic features may include prosodic features, syntax, structure of speech pieces, structure of information, and the like. The prosodic feature may be a super-sound feature or a super-sound segment feature, which is a sound system structure of the language. Prosodic features can be divided into three main aspects: intonation, time domain distribution and stress are realized through the characteristics of the ultrasonic segment. The super-range features include pitch, intensity, and temporal characteristics, loaded by a phoneme or group of phonemes. Prosody is a typical feature of human natural language and has many features common across languages, such as: pitch downtilt, rereading, pauses, etc. are common among different languages. Prosodic features are one of the important forms of language and emotional expression. For training of the recognition model, specifically, an initial neural network model may be obtained first; acquiring a training sample set, wherein training samples in the training sample set comprise various audios and linguistic features corresponding to the labeled audios; taking the audio of the training samples in the training sample set as the input of an initial neural network model, taking the linguistic features corresponding to the input audio as expected output, and training the initial neural network model; and determining the trained initial neural network model as the recognition model.

According to the embodiment, the linguistic features corresponding to the audio to be synthesized can be accurately obtained according to the audio to be synthesized and the pre-trained recognition model, so that the quality of the audio with the specific tone can be improved.

In some optional implementation manners of this embodiment, the execution subject may further determine, according to the audio to be synthesized and the pre-trained recognition model, a category identifier corresponding to each phoneme in the audio to be synthesized, where the pre-trained recognition model in this implementation manner may be used to represent a corresponding relationship between each phoneme in the audio and the category identifier. The obtained category identifier may be an identifier for characterizing a category of each phoneme in the audio to be synthesized, for example, each phoneme in the audio to be synthesized may be a intonation phoneme, a time domain distribution phoneme, an accent phoneme, a pitch phoneme, an accent phoneme, and a pause phoneme, and may be represented by identifiers 1, 2, 3, 4, 5, 6, and 7, respectively. Then, the execution body may determine acoustic features for synthesizing audio with a target timbre according to each phoneme corresponding to each obtained identifier in the features to be recognized and a preset identifier, and a corresponding relationship between the phoneme and the acoustic features. The acoustic feature may be a mel-frequency spectrum corresponding to each phoneme required for generating the target timbre. The execution subject may determine audio having a target tone color corresponding to the audio to be synthesized based on the acoustic feature, and output the audio. The realization mode can enrich the Mel frequency spectrum needed by the audio used for synthesizing the target tone, and improve the accuracy of synthesizing the audio of the target tone.

Step 403, determining a target tone color based on the target audio.

The principle of step 403 is similar to that of step 203, and is not described in detail here.

Specifically, step 403 can be implemented by steps 4031 to 4032:

step 4031, according to the target audio and the pre-trained identity verification model, an identity vector corresponding to the target audio is determined.

The pre-trained identity verification model is used for representing the corresponding relation between the audio and the identity vector. After the execution subject obtains the target audio, the identity vector corresponding to the target audio can be determined according to the target audio and the pre-trained identity verification model. Specifically, the executing agent may input the target audio into the pre-trained authentication model, and output an identity vector corresponding to the target audio. Specifically, the identity vector corresponding to the output target audio may be a set of multidimensional data, and may be used to identify the timbre information of the speaker corresponding to the target audio, for example, the identity vector may be a vector corresponding to a data sequence of [0.3, 0.3, 0.5, 0.6, … ], and a combination of one or more data in the data sequence may be used to characterize a unique timbre.

Step 4032, the target tone is determined according to the identity vector.

After determining the identity vector corresponding to the target audio, the execution subject may determine the target tone according to the identity vector. Specifically, the execution subject may determine the target tone color corresponding to the identity vector according to the identity vector and a preset correspondence between the identity vector and the tone color. As another implementation manner, the executing entity may further determine, according to the similarity between the identity vector and each existing tone vector, a tone vector corresponding to the identity vector, and determine the tone corresponding to the tone vector as the target tone. Specifically, the executing entity determines, in response to determining that the similarity between the identity vector and the corresponding tone vector is greater than a preset threshold, the tone corresponding to the tone vector corresponding to the identity vector as the target tone, where the preset threshold is not limited in the present application.

In this embodiment, the target tone is determined by the identity vector obtained by the identity verification model, and the audio to be synthesized can be synthesized into any tone to be synthesized without being limited by the training sample set of the tone conversion model, that is, the tone not in the training set of the tone conversion model can also be freely and accurately converted, so that the flexibility of tone conversion of the audio to be synthesized is improved, and the user experience is improved. Note that the present application is a conversion of arbitrary audio (or dry sound) into audio (dry sound) of an arbitrary timbre, and not a conversion of text into audio of an arbitrary timbre.

Step 404, determining acoustic features based on the audio to be synthesized, the linguistic features, and the target timbre.

The principle of step 404 is similar to that of step 204, and is not described here again.

Specifically, step 404 can be implemented by steps 4041 to 4042:

step 4041, determining a text corresponding to the audio to be synthesized according to the audio to be synthesized and the pre-trained recognition model.

In this embodiment, the pre-trained recognition model may also be used to represent the correspondence between the audio and the text. After the execution subject obtains the linguistic features, the text corresponding to the audio to be synthesized can be determined according to the audio to be synthesized and the pre-trained recognition model. Specifically, the executing entity may input the audio to be synthesized into the pre-trained recognition model, and output a text corresponding to the audio to be synthesized. The training of the recognition model can also be performed by: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise audio and texts corresponding to the labeled audio; taking the audio as the input of an initial neural network model, taking the text corresponding to the input audio as expected output, and training the initial neural network model; and determining the trained initial neural network model as a recognition model.

Step 4042, determining the acoustic features corresponding to the audio to be synthesized according to the text, the linguistic features, the target timbre and the pre-trained conversion model.

The pre-trained conversion model is used for representing the corresponding relation among texts, linguistic features, timbres and acoustic features. After the execution main body obtains the text corresponding to the audio to be synthesized, the acoustic feature corresponding to the audio to be synthesized can be determined according to the text, the linguistic feature, the target tone and the pre-trained conversion model. Specifically, the execution subject may input the text and the linguistic feature into a pre-trained conversion model to obtain an acoustic feature corresponding to the audio to be synthesized. Specifically, the obtained acoustic feature may be a mel-frequency spectrum corresponding to each phoneme required for synthesizing the target timbre audio.

According to the method and the device, the acoustic characteristics required by synthesizing the target tone audio are obtained according to the text, the linguistic characteristics, the target tone and the pre-trained conversion model, the Mel frequency spectrum characteristics required by generating the target tone audio are perfected, and the accuracy of generating the audio with the target tone is improved.

And step 405, synthesizing the target tone color audio based on the acoustic features, and outputting the target tone color audio.

The principle of step 405 is similar to that of step 205, and is not described here again.

In particular, step 405 may be implemented by the following step 4051, not shown in fig. 4:

step 4051, synthesizing a target timbre audio according to the acoustic features and the corresponding relationship between the preset acoustic features and the audio.

After the execution subject obtains the acoustic features, the execution subject may synthesize the target timbre audio according to the acoustic features and the preset correspondence between the acoustic features and the audio. Specifically, the execution body may input the acoustic feature to the vocoder, and the vocoder may be provided therein with a correspondence relationship of the acoustic feature to the audio synthesis. The acoustic features are converted by the vocoder to obtain target tone color audio with target tone color. The vocoder encodes and encrypts the received acoustic features at its transmitting end to obtain a match with the channel, transmits the match to the receiving end of the vocoder via the information channel, analyzes the received features in the frequency domain, identifies unvoiced and voiced sounds, determines the fundamental frequency of the voiced sounds, and selects the unvoiced-voiced decision, the fundamental frequency of the voiced sounds and the spectral envelope as feature parameters to transmit. Of course, the analysis may also be performed in the time domain, and some acoustic features are periodically extracted to perform linear prediction, so as to generate audio with target timbre corresponding to the acoustic features. Specifically, the vocoders may include a channel vocoder, a formant vocoder, a pattern vocoder, a linear prediction vocoder, a correlation vocoder, and an orthogonal function vocoder, and the type of the vocoder is not particularly limited in the present application.

According to the method and the device, the target tone color audio is synthesized according to the acoustic characteristics and the corresponding relation between the preset acoustic characteristics and the audio, the accuracy of audio synthesis can be improved, the audio of any tone color required by a user can be conveniently synthesized, the interestingness of audio synthesis is improved, and the user experience is improved.

In some optional implementations of the present embodiment, the audio synthesis method further comprises the following model training steps not shown in fig. 4: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise texts, linguistic features, timbres and labeled acoustic features corresponding to the texts, the linguistic features and the timbres; taking the text, the linguistic features and the tone of the training samples in the training sample set as the input of an initial neural network model, taking the acoustic features corresponding to the input text, the linguistic features and the tone as the expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.

In this embodiment, the execution subject may obtain the initial neural network model through a wired connection manner or a wireless connection manner. The initial Neural Network model may include various Artificial Neural Networks (ANN) including hidden layers. In this embodiment, the execution main body may also obtain a pre-stored initial model from a local place, or may also obtain the initial model from a communication-connected electronic device, which is not limited herein.

In this embodiment, the execution subject may acquire the training sample set in various ways. Specifically, the training samples in the training sample set may include text, linguistic features, timbre, and labeled acoustic features corresponding to the text, the linguistic features, and the timbre. The acoustic features corresponding to the text, the linguistic features, and the timbre marked in the training sample may be obtained from a local or communicatively connected electronic device in a wired or wireless connection manner, may also be manually marked in real time, or may be obtained by first performing automatic marking and then manually performing supplementary modification to correct a marking error, which is not specifically limited in this application. The text in the training sample may be obtained from a local or communicatively connected electronic device. The linguistic features in the training samples may be extracted in real-time or may be obtained from a local or communicatively coupled electronic device via a wired or wireless connection. The timbre in the training sample may be extracted in real time or may be obtained from a local or communicatively connected electronic device via a wired or wireless connection.

In this embodiment, it can be understood that the target tone obtained according to the target audio and the pre-trained identity verification model may not be in the training sample set of the initial neural network model, and thus, the conversion model obtained through training may not be able to synthesize any tone. Therefore, the target tone in the acquired target audio needs to be extracted through the identity verification model, and then the target tone is input into the conversion model obtained after training, so that any tone which is not in the training sample set of the initial neural network model can be synthesized, and the audio with any tone can be synthesized conveniently and accurately. In this embodiment, the training sample set for training the conversion model does not include the target timbre.

This embodiment trains the initial neural network model through acquireing training sample set, can obtain the conversion model that can generate the ability of corresponding acoustic feature according to text, linguistic feature, utilizes the conversion model after this training, can realize the conversion of arbitrary singing audio frequency to the singing audio frequency that has arbitrary target tone color to can improve the quality of the synthetic audio frequency that has target tone color, promote audio synthesis's interest, promote user experience.

In some alternative implementations of the embodiment, the audio to be synthesized includes singing audio, and the target timbre audio includes singing audio having a target timbre corresponding to the singing audio.

In particular, a conversion of singing may be achieved based on the present implementation. When the execution subject performs the conversion of singing with the target tone, the audio to be synthesized acquired first may be the singing audio, for example, a piece of song sung by any person: "fifty-six ethnic groups, fifty-six flowers, fifty-six siblings are a family". Then, the executive body determines the corresponding linguistic feature based on the singing audio; determining a target tone color based on the target audio; determining acoustic features based on the singing audio, the linguistic features and the target timbre; finally, the target timbre audio synthesized based on the determined acoustic features may be a singing audio having a target timbre corresponding to the singing audio ("fifty-six nationalities, fifty-six flowers, fifty-six siblings being one"). The target tone color may be any star tone color, and the target tone color is not particularly limited in the present application.

The implementation mode realizes the conversion from the singing of any speaker to the singing of any target tone, enriches the form of audio synthesis, enhances the interest and improves the user experience.

In some alternative implementations of the embodiment, the audio to be synthesized includes a singing audio in a first language, and the target timbre audio includes a singing audio in a second language having a target timbre corresponding to the singing audio in the first language, wherein the second language includes the first language.

In particular, multiple languages with a target timbre of singing by any speaker can be realized based on the implementation. When the executing body performs the singing conversion of the language and the target tone, the audio to be synthesized acquired first may be the singing audio of a first language, and the first language may be one of arbitrary languages, for example, chinese, english, french, and the like. Assuming that the first language is Chinese, the singing audio of the first language may be a segment of Chinese song sung by any person: "fifty-six ethnic groups, fifty-six flowers, fifty-six siblings are a family". Then, the executing body may determine a corresponding singing audio in a second language based on the singing audio in the first language and the pre-installed translation software; the execution subject may determine a corresponding linguistic feature based on the singing audio in the second language; determining a target tone color based on the target audio; determining an acoustic feature based on the singing audio in the second language, the linguistic feature, and the target timbre; finally, the target timbre audio synthesized based on the determined acoustic features may be singing audio in a second language having a target timbre corresponding to the chinese singing audio ("fifty-six nationalities, fifty-six flowers, fifty-six siblings being one"). The target timbre may be any star or animal timbre, and the target timbre is not specifically limited in this application. The second language may be a user-specified language different from the first language or may be the same language as the first language. For example, when the first language is chinese, the second language may be chinese, or english or french, and the present application does not specifically limit the types of the first language and the second language.

Of course, it can be understood that, in this implementation manner, after determining the acoustic features based on the singing audio in the first language, the linguistic features, and the target timbre, the acoustic features with (or carrying) the target timbre are translated in the corresponding second language, and the singing audio in the second language with the target timbre is synthesized based on the acoustic features in the corresponding second language. The timing of the conversion from the first language to the second language is not particularly limited in the present application.

The implementation mode realizes the conversion from the singing of the first language of any speaker to the singing of the second language of any target tone, realizes the multiple languages with the target tone (the target tone is not in the tone in the training set of the conversion model) sung of any speaker, enriches the form of audio synthesis, enhances the interest and improves the user experience.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present application provides an embodiment of an audio synthesis apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the audio synthesizing apparatus 500 of the present embodiment includes: an acquisition unit 501, a linguistic feature determination unit 502, a target tone color determination unit 503, an acoustic feature determination unit 504, and a synthesis unit 505.

An acquisition unit 501 configured to acquire audio to be synthesized and target audio.

A linguistic feature determination unit 502 configured to determine a corresponding linguistic feature based on the audio to be synthesized.

A target tone color determination unit 503 configured to determine a target tone color based on the target audio.

An acoustic feature determination unit 504 configured to determine an acoustic feature based on the audio to be synthesized, the linguistic feature, and the target timbre.

And a synthesizing unit 505 configured to synthesize the target timbre audio based on the acoustic features and output.

In some optional implementations of the present embodiment, the linguistic feature determination unit 502 is further configured to: and determining the linguistic features corresponding to the audio to be synthesized according to the audio to be synthesized and the pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the audio and the linguistic features.

In some optional implementations of this embodiment, the target tone determination unit 503 is further configured to: determining an identity vector corresponding to the target audio according to the target audio and a pre-trained identity verification model, wherein the pre-trained identity verification model is used for representing the corresponding relation between the audio and the identity vector; and determining the target tone according to the identity vector.

In some optional implementations of the present embodiment, the acoustic feature determination unit 503 is further configured to: determining a text corresponding to the audio to be synthesized according to the audio to be synthesized and a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the audio and the text; and determining the acoustic features according to the text, the linguistic features, the target tone and a pre-trained conversion model, wherein the pre-trained conversion model is used for representing the corresponding relation among the text, the linguistic features and the acoustic features.

In some optional implementations of this embodiment, the synthesis unit 505 is further configured to: and synthesizing the target audio according to the acoustic characteristics and the corresponding relation between the preset acoustic characteristics and the audio.

In some optional implementations of this embodiment, the apparatus further comprises a training unit, not shown in fig. 5, configured to: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise texts, linguistic features, timbres and labeled acoustic features corresponding to the texts, the linguistic features and the timbres; taking the text, the linguistic features and the tone of the training samples in the training sample set as the input of an initial neural network model, taking the acoustic features corresponding to the input text, the linguistic features and the tone as the expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.

It should be understood that units 501 to 505 recited in the audio synthesis apparatus 500 correspond to respective steps in the method described with reference to fig. 2. Thus, the operations and features described above for the audio synthesis method are equally applicable to the apparatus 500 and the units included therein, and are not described in detail here.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 6 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses 605 and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses 605 may be used, along with multiple memories and multiple memories, if desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the audio synthesis method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the audio synthesis method provided herein.

The memory 602, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and units, such as program instructions/units corresponding to the audio synthesis method in the embodiment of the present application (for example, the acquisition unit 501, the linguistic feature determination unit 502, the target timbre determination unit 503, the acoustic feature determination unit 504, and the synthesis unit 505 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, i.e., implements the audio synthesis method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the audio synthesizing electronic apparatus, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, and these remote memories may be connected to the audio synthesis electronics through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The audio synthesis electronic device may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603, and the output device 604 may be connected by a bus 605 or other means, and are exemplified by the bus 605 in fig. 6.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the audio synthesizing electronic apparatus, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the disclosure, the acoustic features in the obtained audio to be synthesized are used as the basis of audio synthesis, and the audio with the target tone is synthesized based on the acoustic features, so that the audio synthesis process is simplified, and the accuracy of audio synthesis of specific tone is improved.

In accordance with one or more embodiments of the present disclosure, there is provided an audio synthesizing method including: acquiring audio to be synthesized and target audio; determining corresponding linguistic features based on the audio to be synthesized; determining a target tone color based on the target audio; determining acoustic features based on the audio to be synthesized, the linguistic features and the target timbre; and synthesizing the target tone audio based on the acoustic features and outputting the target tone audio.

According to one or more embodiments of the present disclosure, wherein determining the corresponding linguistic feature based on the audio to be synthesized comprises: and determining the linguistic features corresponding to the audio to be synthesized according to the audio to be synthesized and the pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the audio and the linguistic features.

According to one or more embodiments of the present disclosure, wherein determining the target timbre based on the target audio comprises: determining an identity vector corresponding to the target audio according to the target audio and a pre-trained identity verification model, wherein the pre-trained identity verification model is used for representing the corresponding relation between the audio and the identity vector; and determining the target tone according to the identity vector.

According to one or more embodiments of the present disclosure, wherein determining the acoustic feature based on the audio to be synthesized, the linguistic feature, and the target timbre comprises: determining a text corresponding to the audio to be synthesized according to the audio to be synthesized and a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the audio and the text; and determining the acoustic features according to the text, the linguistic features, the target tone and a pre-trained conversion model, wherein the pre-trained conversion model is used for representing the corresponding relation among the text, the linguistic features, the tone and the acoustic features.

According to one or more embodiments of the present disclosure, the audio synthesizing method further includes: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise texts, linguistic features, timbres and labeled acoustic features corresponding to the texts, the linguistic features and the timbres; taking the text, the linguistic features and the tone of the training samples in the training sample set as the input of an initial neural network model, taking the acoustic features corresponding to the input text, the linguistic features and the tone as the expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.

According to one or more embodiments of the present disclosure, wherein the audio to be synthesized includes a singing audio, the target timbre audio includes a singing audio having a target timbre corresponding to the singing audio.

According to one or more embodiments of the present disclosure, the audio to be synthesized includes a singing audio in a first language, the target timbre audio includes a singing audio in a second language having a target timbre corresponding to the singing audio in the first language, wherein the second language includes the first language.

According to one or more embodiments of the present disclosure, there is provided an audio synthesizing apparatus including: an acquisition unit configured to acquire audio to be synthesized and target audio; a linguistic feature determination unit configured to determine a corresponding linguistic feature based on the audio to be synthesized; a target tone determination unit configured to determine a target tone based on the target audio; an acoustic feature determination unit configured to determine an acoustic feature based on the audio to be synthesized, the linguistic feature, and the target timbre; and a synthesizing unit configured to synthesize the target timbre audio based on the acoustic features and output.

According to one or more embodiments of the present disclosure, wherein the linguistic feature determination unit is further configured to: and determining the linguistic features corresponding to the audio to be synthesized according to the audio to be synthesized and the pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the audio and the linguistic features.

According to one or more embodiments of the present disclosure, wherein the target tone color determination unit is further configured to: determining an identity vector corresponding to the target audio according to the target audio and a pre-trained identity verification model, wherein the pre-trained identity verification model is used for representing the corresponding relation between the audio and the identity vector; and determining the target tone according to the identity vector.

According to one or more embodiments of the present disclosure, wherein the acoustic feature determination unit is further configured to: determining a text corresponding to the audio to be synthesized according to the audio to be synthesized and a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the audio and the text; and determining the acoustic features according to the text, the linguistic features, the target tone and a pre-trained conversion model, wherein the pre-trained conversion model is used for representing the corresponding relation among the text, the linguistic features, the tone and the acoustic features.

According to one or more embodiments of the present disclosure, wherein the audio synthesizing apparatus further comprises a training unit configured to: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise texts, linguistic features, timbres and labeled acoustic features corresponding to the texts, the linguistic features and the timbres; taking the text, the linguistic features and the tone of the training samples in the training sample set as the input of an initial neural network model, taking the acoustic features corresponding to the input text, the linguistic features and the tone as the expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.

It should be understood that the above embodiments are merely exemplary embodiments, but are not limited thereto, and include other methods known in the art that can implement audio synthesis. Steps may be reordered, added, or deleted using the various forms of flow shown above. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An audio synthesis method, comprising:

acquiring audio to be synthesized and target audio;

determining corresponding linguistic features based on the audio to be synthesized;

determining a target timbre based on the target audio;

determining an acoustic feature based on the audio to be synthesized, the linguistic feature, and the target timbre;

and synthesizing the target tone audio based on the acoustic features and outputting the target tone audio.

2. The method of claim 1, wherein the determining a corresponding linguistic feature based on the audio to be synthesized comprises:

and determining the linguistic features corresponding to the audio to be synthesized according to the audio to be synthesized and the pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the audio and the linguistic features.

3. The method of claim 1, wherein the determining a target timbre based on the target audio comprises:

determining an identity vector corresponding to the target audio according to the target audio and a pre-trained identity verification model, wherein the pre-trained identity verification model is used for representing the corresponding relation between the audio and the identity vector;

and determining the target tone according to the identity vector.

4. The method of claim 1, wherein the determining acoustic features based on the audio to be synthesized, the linguistic features, and the target timbre comprises:

determining a text corresponding to the audio to be synthesized according to the audio to be synthesized and a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the audio and the text;

and determining the acoustic features according to the text, the linguistic features, the target tone and a pre-trained conversion model, wherein the pre-trained conversion model is used for representing the corresponding relation among the text, the linguistic features, the tone and the acoustic features.

5. The method of claim 4, wherein the method further comprises:

acquiring an initial neural network model;

acquiring a training sample set, wherein training samples in the training sample set comprise texts, linguistic features, timbres and labeled acoustic features corresponding to the texts, the linguistic features and the timbres;

taking the text, the linguistic features and the tone of the training samples in the training sample set as the input of the initial neural network model, taking the acoustic features corresponding to the input text, the linguistic features and the tone as the expected output, and training the initial neural network model;

and determining the trained initial neural network model as the conversion model.

6. The method of any of claims 1-5, wherein the audio to be synthesized comprises singing audio, and the target timbre audio comprises singing audio corresponding to the singing audio having a target timbre.

7. The method of any of claims 1-5, wherein the audio to be synthesized comprises singing audio in a first language, the target timbre audio comprises singing audio in a second language having a target timbre corresponding to the singing audio in the first language, wherein the second language comprises the first language.

8. An audio synthesis apparatus comprising:

an acquisition unit configured to acquire audio to be synthesized and target audio;

a linguistic feature determination unit configured to determine a corresponding linguistic feature based on the audio to be synthesized;

a target tone color determination unit configured to determine a target tone color based on the target audio;

an acoustic feature determination unit configured to determine an acoustic feature based on the audio to be synthesized, the linguistic feature, and the target timbre;

and a synthesizing unit configured to synthesize the target timbre audio based on the acoustic features and output.

9. An audio synthesis electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.