CN115083386A

CN115083386A - Audio synthesis method, electronic device, and storage medium

Info

Publication number: CN115083386A
Application number: CN202210656027.0A
Authority: CN
Inventors: 吴梦玥; 俞凯; 李光伟; 徐薛楠; 戴凌峰
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-09-20

Abstract

The invention discloses an audio synthesis method, electronic equipment and a storage medium. In the method, a text characteristic vector corresponding to a target sentence to be subjected to audio synthesis is obtained; determining target codebook information corresponding to the text feature vector; determining a target spectrogram corresponding to the target codebook information based on a preset codebook decoder; and generating a synthetic audio corresponding to the target sentence according to the target spectrogram. Therefore, the frequency spectrum is reconstructed by using the codebook information corresponding to the text feature vector of the sentence, so that the frequency spectrum construction operation can be completed more lightweight and efficiently, and the reliability and high quality of speech synthesis are guaranteed; in addition, the audio is generated directly according to the unconstrained text input, natural and vivid audio can be generated, and a better quantitative result is realized.

Description

Audio synthesis method, electronic device, and storage medium

Technical Field

The present invention relates to audio processing technologies, and in particular, to an audio synthesizing method, an electronic device, and a storage medium.

Background

With the increasing popularity of audio technology, various audio synthesis requirements are continuously permeating to various aspects of people's life and work, such as voice navigation, news TTS broadcasting, and the like.

Currently, the techniques of sound generation commonly available on the market mainly focus on generating a specific kind of sound such as text based speech generation or score based music generation.

However, the conventional audio generation technology mainly focuses on a specific sound class such as speech or music, and the form and content thereof are greatly limited. Recent research has conditioned visual cues as a condition for synthesizing general audio.

In addition, there is little research in the industry for generating diversified natural sounds by directly using natural language descriptions as clues. Unlike visual information, textual descriptions are compact in nature, but hide rich meaning behind them and place higher demands on the possibilities and complexity of generating audio.

In view of the above problems, the industry has not provided a better solution.

Disclosure of Invention

An embodiment of the present invention provides an audio synthesis method, an electronic device, and a storage medium, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides an audio synthesis method, including: acquiring a text characteristic vector corresponding to a target sentence to be subjected to audio synthesis; determining target codebook information corresponding to the text feature vector; determining a target spectrogram corresponding to the target codebook information based on a preset codebook decoder; and generating a synthetic audio corresponding to the target sentence according to the target spectrogram.

In a second aspect, an embodiment of the present invention provides an electronic device, which includes: the computer-readable medium includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the above-described method.

In a third aspect, the present invention provides a storage medium, in which one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform the steps of the above-mentioned method of the present invention.

In a fourth aspect, the present invention also provides a computer program product comprising a computer program stored on a storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the above method.

The embodiment of the invention has the beneficial effects that:

when the corresponding audio synthesis operation is performed on the sentences, the frequency spectrum is reconstructed by using the codebook information corresponding to the text feature vectors of the sentences, so that the frequency spectrum construction operation can be completed more lightweight and efficiently, and the reliability and high quality of the speech synthesis are ensured; in addition, the audio is generated directly according to the unconstrained text input, natural and vivid audio can be generated, and a better quantitative result is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 shows a flow diagram of an example of an audio synthesis method according to an embodiment of the application;

FIG. 2 shows a flow chart of an example of an implementation of step 110 in FIG. 1;

FIG. 3 shows a flowchart of an example of an implementation of step 230 in FIG. 2;

FIG. 4 shows a flow diagram of an example of an audio synthesis method according to an embodiment of the application;

FIG. 5 illustrates an architectural diagram of an example of a natural audio synthesis system according to an embodiment of the present application;

FIG. 6 shows a flow diagram of an example of extracting text features according to an embodiment of the application;

FIG. 7 illustrates a detailed information chart of an Audio caps dataset used in accordance with an embodiment of the present application;

FIG. 8 shows a spectral simulation diagram of an audio sample generated from text according to an embodiment of the application;

FIG. 9 shows a comparison table of model performance indicators in different configurations according to an embodiment of the present application;

FIG. 10 shows a manual evaluation table for different generated audio;

FIG. 11 shows a comparison of a spectrogram generated under different input conditions with the spectrum of an actual live scene;

fig. 12 is a schematic structural diagram of an embodiment of an electronic device according to the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used herein, a "module," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should be further noted that the terms "comprises" and "comprising," when used herein, include not only those elements but also other elements not expressly listed or inherent to such processes, methods, articles, or devices. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

It should be noted that, for the generated sound category, such as speech generation, the common practice is text-to-speech (tts) method, which involves the current end-to-end method based on deep learning, such as Tacotron, and non-autoregressive FastSpeech 2. In addition, for Music generation, a commonly used technique is Music VAE or the like to generate corresponding Music from a musical score.

However, these techniques are directed to generating audio that strongly corresponds to the input, resulting in at least the following technical drawbacks:

on one hand, audio containing a single sound event, whether speech or music, cannot be actually removed from the limitation of the type, while audio in real life often contains multiple sound events, such as speech, noise, background music, and the like, which occur simultaneously or alternately, however, the above technologies cannot generate such rich audio.

On the other hand, the generated audio has a great limitation in content, such as speech generation, and even though there may be some stylistic variations in the generated audio, the generated content corresponds exactly to the input text without randomness.

Generally, in order to generate audio having a plurality of sound types, it is best to construct an end-to-end correspondence between text and audio. That is, for two groups of corresponding texts and audios in the training set, a model is used to learn the corresponding relationship, but this method is very poor in effect in practice because for a long audio (10s +), the model parameters and the calculation amount for directly constructing the spectrum of the audio are very large, and since the front and the back of the audio usually have strong correlation, if the audio is directly generated from the text, the synthesized audio is very astringent, and there is a great difference from the natural sound.

Currently, the generation of vivid audio with less input restriction is achieved by visual cues (images or video) or pseudonyms. However, there is little research on how to generate diverse audio from free-text descriptions.

Fig. 1 shows a flow chart of an example of an audio synthesis method according to an embodiment of the application.

As shown in fig. 1, in step 110, a text feature vector corresponding to a target sentence to be subjected to audio synthesis is obtained. It should be understood that the sentence can be a variety of texts composed of various random words, and should not be limited herein.

Specifically, the target sentence may be input to a preset language model to extract a text feature vector or a feature sequence corresponding to the target sentence.

In step 120, target codebook information corresponding to the text feature vector is determined.

Specifically, the text feature vector is a text feature sequence composed of a plurality of text feature points, and codebook (codebook) information corresponding to the text feature vector can be quickly obtained by inputting the text feature sequence into a pre-trained model.

In step 130, a target spectrogram corresponding to the target codebook information is determined based on a preset codebook decoder. Illustratively, the codebook decoder is preconfigured or pre-trained, and can quickly and accurately resolve the spectrum points corresponding to the respective codebook segments.

It should be noted that, due to the diversity of different sound classes, one of the biggest challenges is to effectively reduce the sampling space of the generation algorithm, and this problem can be effectively solved by using the codebook.

In step 140, a synthetic audio corresponding to the target sentence is generated according to the target spectrogram. Here, various non-limiting types of vocoders may be employed to encode the spectrum to obtain corresponding synthesized audio.

In the synthesis process from the text to the audio, the spectrum is converted through the codebook relay, the internal global relation is constructed in the spectrum by using the shorter expression, compared with the method of directly generating the spectrum from the text, the characteristic dimension of constructing the spectrum is reduced, and the efficiency and the quality of audio synthesis are improved.

It should be noted that the audio synthesis method provided by the embodiment of the present application is different from the conventional TTS technology (i.e., synthesizing speech or synthesizing music). Specifically, the synthesized speech or the synthesized music has a strong correspondence, for example, the speech TTS is a speech synthesized with a specific tone corresponding to the text. However, these synthesized audios lack randomness, and the corresponding voices of the same sentence in different contexts or contexts should have variability to better fit the real scene. In contrast, by the embodiment of the application, the text feature vector of the sentence is considered as a whole instead of calculating each feature point, so that the generated audio can be more natural and vivid

Auditory is one of the most important senses in the human perception world. We can easily imagine the specific sound of an object, simply by its name. However, due to the myriad sound categories and their complex symbiotic relationships in nature, it is currently hard to imagine that a piece of audio corresponds to a long sentence containing multiple sound events. For example, for the phrase "food is frying and woman is speaking," the uncertainty in the temporal order and duration of the two sounds greatly affects the possible audio content.

In some service scenarios of the embodiment of the present application, it may be achieved that diversified audios are respectively generated for the same text in different contexts, for example, synthetic audios corresponding to "violin sounds" in different contexts are differentiated. In addition, the "audio" in the embodiment of the present application may also be other content data different from "voice", for example, the synthesized audio corresponding to the text "sound of violin" may be the sound of natural violin, and the violin sounds generated for different context scenes are different from each other, and may also exhibit stronger randomness, which enables generation of one audio clip having various sound sources based on natural language input.

Furthermore, an emerging field is music generation, the most straightforward is to read note combinations from a score. However, similar to text-to-speech, generating music from a musical score greatly limits the content of the results, which is closely linked to the input conditions, with no flexibility and randomness.

With the development of general sound research, an emerging study focuses on audio generation of a given text as input, using pseudonyms to synthesize the ambient sound. However, the diversity of sound types is also limited because the pseudonym is limited and sometimes has poor expressive power. According to the embodiment of the application, under different input conditions, various and comprehensive audio contents are generated by adopting a cross-mode method, and the flexibility and diversity of audio generation are expanded.

Thus, it can be seen that producing flexible sounds is a trend in audio generation research. This document aims to generate more diverse audio content with different sound categories and complex relationships from free text descriptions.

Fig. 2 shows a flowchart of an example of the implementation process of step 110 in fig. 1.

As shown in fig. 2, in step 210, the target sentence is marked by segmentation according to the preset word list to determine the corresponding marked word sequence.

In step 220, position indication tokens (or position indication symbols) are respectively set at the head and the tail of the token sequence to generate a token sequence.

In step 230, a text feature vector corresponding to the token sequence is extracted based on the language model.

Therefore, through token identification, the language model can quickly identify the head and tail positions of the target sentence, and perform characterization processing by taking the sentence as a whole.

It should be noted that the lengths of different sentences are diversified, and it is necessary to ensure consistency in the lengths of the feature sequences input to the model. In view of this, fig. 3 shows a flowchart of an example of the implementation process of step 230 in fig. 2.

As shown in fig. 3, in step 310, the token sequence is padded into a maximum number sequence having a fixed length, such that the maximum number sequence contains the token sequence and the at least one placeholder. It should be understood that the fixed length of the maximum number of sequences should be set large enough to be larger than the length of the token sequence corresponding to each statement, and the vacant positions can be represented by placeholders, so that the consistency of the feature length input to the language model is guaranteed.

In step 320, the maximum number of sequences and the attention mask are input to the language model, so that the language model extracts the text feature vectors corresponding to the token sequences. Here, note that the mask is used to cause the language model to ignore the respective placeholders, e.g., to identify the respective placeholders as invalid characters. Therefore, the method can ensure that the input feature quantity is constant, and can also ensure that the language model can successfully extract effective information in the input features.

Fig. 4 shows a flow chart of an example of an audio synthesis method according to an embodiment of the application.

In the embodiment of the application, a system for generating audio from text is provided, and high-quality text related audio can be generated. To efficiently extract spectrograms from fixed-dimension representations, vector-quantized generated adaptive Network (VQGAN) is used to train the reference codebook. Specifically, a spectrogram can be constructed by training a Transformer from text features extracted from a pre-trained language model. The generated spectrogram is finally converted into a waveform through a MelGAN vocoder.

As shown in fig. 4, in step 410, a text feature vector corresponding to a target sentence to be subjected to audio synthesis is obtained.

In step 420, a codebook index corresponding to the text feature vector is determined.

Specifically, the text feature vector may be parsed by a machine learning model with a self-attention mechanism to obtain a codebook index (or a codebook number) corresponding to the text feature vector.

In some examples of embodiments of the present application, a Transformer may be employed to resolve the relationship between text feature vectors and codebook indices.

In step 430, the matching target codebook representation is queried from a preset reference codebook according to the codebook index. Here, the reference codebook comprises a plurality of codebook indices and corresponding codebook representations.

In step 440, the countermeasure network VQGAN is generated based on vector quantization, and the target spectrogram corresponding to the target codebook representation is determined.

Here, a countermeasure network generated by vector quantization is used to train a codebook to learn the spectral features of the real audio, so that the feature dimension for constructing the spectrum can be effectively reduced.

In particular, each codebook representation in the reference codebook is determined by a vector quantization reactive network based on a natural audio sample. In addition, the natural audio sample contains an audio text pair with annotations, and the annotations are used for describing differential information of texts corresponding to audios in the audio text pair, for example, through the annotations, tasks of sentences can be better defined, complex audios can be generated based on free text description, so that the generated audios can keep high quality and naturalness, sound categories and relations of the sound categories described in the sentences can be comprehensively reflected, and diversified synthetic audios of the same text in different contexts or contexts can be realized.

It should be noted that although it is very complicated to obtain useful information from text to generate audio, training also requires text describing the audio. Unlike audio-video or audio image pairs that can be naturally obtained from the original video source, audio-text pairs require additional annotations. Annotations describing the same audio may vary greatly in word selection, overall style, and writing order.

In some implementations, the audio text pairs in the natural audio samples are determined based on audio caption data. Audio captioning is a task aimed at summarizing audio in words, which also requires training of audio text pairs, and various types of third party reference audio caption data sets, such as AudioCaps, Clotho, and Mac, etc., may be used. For example, each audio clip in AudioCaps is 10 seconds long and has at least one text annotation that fits as text into an existing data set that generates audio.

In step 450, a confrontation neural network (MelGAN) is generated based on the preset mel spectrum, and a synthesized audio waveform signal corresponding to the mel spectrum is generated.

In the embodiment of the application, the countermeasure network is generated by utilizing vector quantization to train the codebook, so that the codebook can learn the spectral characteristics of real audio, and the characteristic dimension of the constructed spectrum is reduced. Then, a relation between the original text and the spectrum is constructed using a transform-based method, and finally, a corresponding audio waveform signal is generated from the spectrum using MelGAN.

Fig. 5 shows an architectural schematic diagram of an example of a natural audio synthesis system according to an embodiment of the application.

As shown in fig. 5, the natural audio synthesis system is composed of a plurality of parts, and specifically includes a feature extraction part a, an index prediction part B, a reference codebook part C, and an audio waveform generation part D.

Specifically, in the feature extraction part a, the text is parsed into corresponding feature vectors using a pre-trained language model.

It should be noted that, in the audio frequency spectrum generation technology, a simple method is to predict the optimal value of each pixel in the frequency spectrum. Specifically, we can convert the spectrogram into a one-dimensional array by computationally flattening the two-dimensional array, and auto-regress using a Transformer model to generate a value for each location. However, the size of pixels in the original spectrum can be very large (10 second long audio clips far exceed 10000) making the Transformer ineffective to participate.

In addition, before training the Transformer model to sample the codebook index, the textual description needs to be extracted to provide the information needed to generate the audio piece. For a given sentence, the text feature extraction process is shown in FIG. 6, and the sentence "A man spreads as birds chirp and dog bars" is first input to a marker and segmented into markers according to a prepared word list. The [ CLS ] token and the [ SEP ] token are placed at the beginning and end of the token sequence, respectively, and then sent into the language model.

To fix the length of the markup of the input language model, the sequences can also be padded to a preset number MaxNum, followed by a markup sequence, an attention mask also being denoted by [0, 1 ]]Is input into the language model MaxNum. If the corresponding mark is presentIs not [ pad ]]Then the value of this attention mask is 1, which can cause the language model to ignore the placeholder. The sequence of tokens is then sent to a pre-trained language model to extract features F that are subsequently used for index prediction _text 。

In the index prediction part B, the spectral features are generated step by step from the text vector obtained in A by using the structure of a Transformer.

In the reference codebook section C, it is necessary to summarize the features in the natural audio using the reference codebook before this, so that the spectral features with larger size can be analyzed by using smaller features.

In the embodiments of the present application, in order to express the mel-spectrum more efficiently, the spectrum is expressed using a discrete codebook. In particular, VQGAN is conventionally used to decode images by training codebooks using VQGAN (an auto-encoder). In the embodiment of the application, a given frequency spectrum x epsilon R is reconstructed by adopting VQGAN ^M×T To x ^ R ^M×T M is the number of Mel-map sets, and T represents the time dimension.

Specifically, during the training process, the spectrogram is encoded as a smaller Z ∈ R ^m×t×nz With nz serving as the encoding dimension. The small scale representation may then be flattened to m x t, determining the order of the entries selected from the learned codebook

Here, each individual z ^ is plotted in z ^ using quantization element q _ij To the nearest codebook index z _l It can be described as:

z _q is a quantized representation, which can be further decoded by the decoder as a spectrogram x ^:

in the audio waveform generation section D, the feature vectors generated in B are rearranged and restored to spectral features with reference to a codebook, and the final audio is generated using an audio generator of MelGAN (mel frequency spectrum generation countermeasure network).

In particular, the spectrogram can be converted into a waveform by a vocoder, and in order to maintain the fidelity and quality of the final audio, the vocoder needs to be used to close the gap between the predicted spectrogram and the waveform. In the embodiment of the application, a vocoder with the architecture of MelGAN with hinge loss target is adopted, and besides the signal of the discriminator, an additional feature for matching the target is realized for generator training.

Here, in order to minimize the difference between x and x ^ the counter-propagating gradient in equation (2) can be copied from the encoder to the decoder through the gradient estimator, although the quantization process in equation (1) is not differentiable. Specifically, the loss function of equation (2) can be described as:

wherein the content of the first and second substances,

and sg [ ·]Representing a stop gradient operation with zero gradient during back propagation and back propagation, and β is the weighting factor.

Further, VQGAN expands the loss function in equation (3) with a discriminator and perceptual loss. The VQGAN is adjusted on the spectrogram and replaces the original perceptual loss with the learned perceptual audio block similarity. The final loss function of VQGAN over the spectrum is:

where C is a discriminator based on compensation, x ^s And x ^ a ^s Is the feature of the basic truth and

the VGGish model trains scales on a set of audio frequencies.

In particular, by text feature F _text A spectrum with potential converters can be generated. In view of text feature F _text The transform model predicts the distribution of the next index si and the index s < i. The probability of a sequence p(s) is:

the Transformer model is trained by maximizing the log-likelihood of the conditional sequence:

in the inference process, the codebook index s ^ _i Sampling p from the provided distribution until i ═ m × t, these indices are unformed (m, t), looking up the codebook by replacing each code with an index in the codebook Z:

(z _q ) _ij ＝z _l if s _ij equal to l equation (7)

When i ∈ {1, 2, 3.,. m } and j ∈ {1, 2, 3.,. t }, we denote z ^ q ∈ R by mapping the index back to the codebook ^m ^×t×nz The generated spectrogram can be recovered by a decoder:

in the embodiment of the present application, by adjusting the data set of MelGAN, better encoded voice can be realized, and text-audio pairs can be used as training data. Compared with the WaveNet and other vocoders, MelGAN accelerates the reconstruction process to a great extent and maintains good quality.

In some examples of embodiments of the present application, AudioCaps, which are a subset of the complete AudioSet of data, that is subtitles with human tags, may be employed as the main data set. Operationally, 49501 audio clips, 495 for verification, and 964 for testing, may be downloaded in the training set. Each audio clip is approximately 10 seconds long and the training audio clip has a corresponding textual description. Since there are five reference descriptions for each audio clip in the validation and test set, the evaluation may be chosen in sentence units for convenience in evaluating the performance of the generative model.

Each audio clip has a corresponding sound event label in the AudioSet, except for the text title. Although these cited sound labels are not used during training, they can be used as a reference to determine if the generated spectrum is similar to the original spectrum at the classification level. In the embodiment of the present application, the detailed information of the AudioCaps data set used is shown as a chart in fig. 7.

Statistically, the average number of words per title is 8.71. To balance the length of the token with the model size, a final maximum token length MaxNum is set to satisfy most cases. The reference sound tags for audio clips of Audiocaps averaged 2.73, indicating that most clips had two or more sound events in a 10 second clip. Unlike sound generation of a single sound category such as speech and music synthesis, generating diverse and vivid sounds from conditional text requires models familiar with various sound types and the ability to parse natural language into operable forms. Thus, the task of text-to-audio generation is more challenging and complicates fine-grained generation criteria.

It should be noted that sound generation in the embodiments of the present application is conditional on textual description. Therefore, its representativeness mainly affects the speech synthesis quality. In order to observe the influence of the characteristics of the input text on the generation performance, a plurality of representation methods are adopted for comprehensive comparison. Specifically, a pre-trained BERTbased baseless model may be employed as the standard text feature extractor. In addition, a BERT-medium model of smaller size may be used in place of it to weaken the interpretability of the input features and ultimately reduce the number of input features.

In particular, the output embedding using the first hidden state of the language model represents the features from the [ CLS ] token. This feature exhibits validity and operability in the classification task due to the pretraining task of BERT.

Thus, the MaxNum is clipped to 1 so that there are five sets of comparison results:

(a) featureless, the text features may be replaced with random embeddings of the same shape, first using unconditional sampling. Thus, the generated spectrogram has no relation with the input text.

(b) Feature 1 is extracted by a smaller language model.

(c) Feature 1 is extracted by a standard language model.

(d) The complete features are extracted by a smaller language model.

(e) The complete features are extracted by a standard language model.

In order to objectively assess the performance of a text-to-audio generation system, the generated spectrogram and reconstructed audio segments (waveforms) need to be analyzed. Specifically, the quality of the spectrogram generated by the test relates to three indexes: FID, PSNR, and MMKL, respectively, as described below.

Since the spectrogram is a single-channel image, the FID metric can be used to measure how similar the generated spectrogram resembles the original spectrogram. FID requires a classifier to compute the freschel Distance (frechet Distance) between the features and the middle layer. However, FID performs well in terms of overall quality of the resulting spectrogram from the measurement by assuming that the features conform to a multivariate normal distribution, but there is little concern about how a single spectrogram will match the original spectrogram and the conditional text.

The peak signal-to-noise ratio (PSNR) represents the ratio between the maximum possible power of the signal and the power of the corrupting noise that affects its representation fidelity. By introducing PSNR, the average mathematical difference between the generated spectrum and the original spectrum can be calculated.

MMKL is a modification of MKL. And calculating the output distribution distance between the spectrogram generated under the given input condition and the real spectrogram through the MKL. Since the initial melceptation was trained on VGSOUND, with only one sound tag per segment, but most sound segments in AudioCaps have more than two sound classes, the Softmax activation layer was replaced with Sigmoid and melceptation was trained on AudioSet for evaluation. Therefore, MKL here is promoted to M (multicals) MKL.

On the other hand, the embodiment of the application realizes that a vocoder generates the waveform of a given spectrogram, and the quality of the waveform can be further evaluated through subjective evaluation, namely, manual evaluation is adopted on the quality and the correlation of the generated waveform. For each given textual description, the scorer will score according to the "relevance" and the "quality" of the generated audio. Here, "relevance" measures how well the audio matches the text description, including whether the audio covers all sound events in a given text, and whether these sounds reasonably occur (e.g., if the text purports to dog bark after a male speaks, but the two sounds are reversed in a given audio, which would result in a discount on the "relevance" score). In addition, "quality" represents how similar sounds in the generated audio are to sounds in real life, and how difficult it is to identify them (e.g., if the text contains a noise description, the noise in the audio does not directly result in a low "quality" score.

For data pre-processing, each audio segment is approximately 10 seconds long 32000Hz, extracted every 12.5ms, with a window size of 50 ms. The number of Mel groups was 80 groups. The Mel spectrogram is finally cut into a shape of (80, 800). The pre-trained language model is BERT, which extracts the textual description as a feature taken from all hidden states (30768).

In training the codebook, the length | Z | of the codebook is set to 1024. The codebook encoder consists of two residual blocks with 2D convolution stacks and converts the spectrogram to (5, 50), i.e. the coded representation z ^ and the codebook representation z ^ _q The shape of (a); in addition, the decoder is vice versa. The countermeasure portion lost in equation (4) is codebook trainedThe initial phase of the routine is zeroed to stabilize the model. For the Transformer section, a GPT2 based architecture is used, with a hidden dimension of 1024. The codebook index embedding layer and the text feature are converted to 1024 through a Full Connection (FC) layer. The output of the converter passes through a 1024-way classifier ending in Softmax, which learns the distribution of the next codebook identification. In the reasoning process, the distribution of the next codebook token is cut into a Top-K probability so as to realize the control on the diversity of the sample; where K ranges from K | Z | ═ 1024 to K ═ 1, and was selected in the experiment as K ═ 512.

A perception adaptation model is trained on an audio set by using binary cross entropy loss, and 33.4% of audio calibration accuracy is achieved.

Further, the inventors have also evaluated the performance of the text audio generation system proposed in the present application both quantitatively and qualitatively.

Fig. 8 shows a simulation diagram of the effect of an audio sample generated from text according to an embodiment of the application. By selecting some of the generated samples from the textual descriptions in the test set, they are higher than average in both quality and fidelity.

In terms of quantitative results, how the condition features affect the performance of generating audio. In particular, a final result may be calculated to evaluate the similarity between the generated spectrogram and the basic true spectrogram. Since there was no previous work on text-to-audio generation, the best model (e) can be compared to the models in the different configurations mentioned above in this application. As shown in the graph in fig. 9, (a) lists results without any input condition, which are set as the lower limit of the PSNR and the upper limit of the MMKL; (b, c) comparing the effect of feature size under 1 feature condition; similarly, (d, e) mainly illustrates the overall behavior with 30 different feature sizes.

From the graph data of fig. 9, the following conclusions can be drawn: (1) the upper limits of the peak signal-to-noise ratio (PSNR) and the average peak signal-to-noise ratio (MMKL) in (a) are 13.69 and 15.80, respectively, if sampling is performed without an input text condition. (2) Adding at least one text function will cause the two indicators to transition significantly to 14.78 ↓ and 10.43 ↓, illustrating the importance of the input text information to the generation of audio. (3) From (c, e), an increase in the number of text features will increase the correlation (higher PSNR and lower MMKL) and the overall production quality (lower FID). It can therefore be concluded that as the input information increases, the audio generation model provided by embodiments of the present application is able to better distinguish between sound sources in the input sentence and the relationship between the two (earlier, later, or simultaneously), resulting in a more reasonable audio output. (4) As can be seen from (d, e) and (b, c), the use of text features with small feature size extracted from a small language model has a slight negative effect on the quality of the generation. However, this is not a critical factor, since audio is generated from a sentence, focusing mainly on the mentioned sound classes and their temporal relation of occurrence. Furthermore, subtle changes in the presentation typically do not result in significant changes in the level of the spectrogram (audio).

It should also be noted that the FID score in the graph shown in fig. 9 is abnormally low, which may be caused by two factors: (1) the Melception classifier was trained on the AudioSet data. In the embodiment of the present application, the last layer of the melceptation model is replaced from Softmax to Sigmoid to accommodate multiple classes of classification tasks (audio samples have one or more tags) in the AudioSet date. This may lead to the embedding of the penultimate layer not conforming to the strong assumption, i.e. the embedding conforms to a multivariate normal distribution. (2) FID is not a well-suited metric for data sets whose samples vary widely over different time periods. However, the change in the value of the FID in different configurations still indicates the effectiveness of the text conditional generation model provided by the embodiments of the present application.

Therefore, when generating corresponding audio from text, the audio generated by the model provided by the embodiment of the application is better than the audio generated without using the feature of the text or the local text feature.

In addition to the above quantitative analysis results to demonstrate the superiority of the system provided by the embodiments of the present application, the inventors have also organized multiple scorers to score and evaluate the performance of the generated audio to better determine whether the audio generation has high quality and relevance.

Specifically, 200 samples were randomly drawn from 964 samples generated in the AudioCaps test set. For each textual description, four audios (provided by the scorer) are included: (1) raw audio from the AudioCaps test set; (2) the generated audio has no text characteristics; (3) audio conditioned on a characteristic; and (4) audio conditioned on the complete features (with the best quantitative results). Each sample (one text description and four audios) was scored and averaged by four scorers. Fig. 10 shows an artificial scoring table for different generated audio, where 10 is full.

As can be seen from the manual evaluation results, the relevance and quality of the generated audio increase with the increase of the number of the conditional text features. Text feature settings significantly affect relevance, but contribute less to quality, consistent with quantitative results. Based on the graphical illustration of fig. 10, the audio generated by the model provided by the embodiments of the present application exceeds the baseline in both text Relevance (Relevance) and audio Quality (Quality).

Furthermore, when sampling is performed using only one text feature, the quality of the generated audio is differentiated if only one sound is mentioned in the conditional text.

On the other hand, when the input text relates to different sound classes, sampling with more text features will result in higher relevance scores, and two samples may be selected to visualize the difference of the generated spectrogram. Specifically, as shown in fig. 11, which shows a comparison of spectrograms generated under different input conditions with actual reality, for statement (a), the spectrogram of 30Feats corresponds to less sound quality than 1 Feat; whereas for statement (b), the spectrogram of 30Feats has better sound quality than 1 Feat.

In addition, the snoring in position (a) and sewing in position b) of the expected sound in the two spectrograms are positioned manually. When the text description is simple and contains only one sound category (a), adjusting only one function will improve the clarity of the generated sound clip. When the sentence is more complex, involving more sound classes (b), using the full number of features will produce more comprehensive audio (the stitched sound in audio is no longer limited to only one feature). This means that adding more features will help the model to get different information from the text, but the focus on each sound event during the sampling process will be reduced.

In the present embodiment, the focus is on a completely new but challenging task of generating high quality and diverse audio content given a textual description. By using the VQGAN trained reference codebook, the audio model in the embodiment of the application can efficiently reconstruct a mel spectrogram from a smaller representation, and a method for closing the gap between the input text characteristic and the prediction representation during sampling based on a Transformer is realized. The quantization indexes PSNR, FID and MMKL of the best model have average values of 14.92, 0.851 and 10.05, respectively. In addition, based on the human assessment verification, the overall quality and relevance of the generated audio piece to the given text is also confirmed. Evaluation results in two aspects show that the method provided by the embodiment of the application can generate natural and various audios corresponding to the sound events in the text.

In embodiments of the present application, the limitations of the generation of specific audio are broken to produce diverse sounds for cues using natural language descriptions. Unlike visual information, textual descriptions, while seemingly brief, hide rich meaning behind them, making audio more likely and complex. Specifically, codebook is trained with VQGAN to learn the discrete representation of the spectrum. For a given textual description, the features are sent to a pre-trained Transformer to generate a feature index. The index is then decoded into a spectrogram and further transformed into a waveform by a MelGAN-based vocoder. Thus, the generated waveform has high quality and fidelity while well describing the given text.

By the embodiment of the application, the audio is directly generated from the unconstrained text input for the first time. Various experimental data show that the scheme of the embodiment of the application can generate natural and vivid audio, and better quantitative results are realized.

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In some embodiments, the present invention provides a non-transitory computer readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the above audio synthesis methods of the present invention.

In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the audio synthesis methods described above.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform an audio synthesis method.

Fig. 12 is a schematic diagram of a hardware structure of an electronic device for executing an audio synthesis method according to another embodiment of the present invention, and as shown in fig. 12, the electronic device includes:

one or more processors 1210 and a memory 1220, with one processor 1210 being an example in fig. 12.

The apparatus for performing the audio synthesizing method may further include: an input device 1230 and an output device 1240.

The processor 1210, memory 1220, input device 1230, and output device 1240 may be connected by a bus or other means, such as by a bus connection in fig. 12.

The memory 1220, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the audio synthesis method in the embodiments of the present invention. The processor 1210 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 1220, namely, implements the audio synthesis method of the above-described method embodiment.

The memory 1220 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice interactive apparatus, and the like. Further, the memory 1220 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 1220 may optionally include memory located remotely from the processor 1210, and such remote memory may be connected to the audio synthesis device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 1230 may receive input numerical or character information and generate signals related to user settings and function control of the audio synthesizing apparatus. The output device 1240 may include a display device such as a display screen.

The one or more modules are stored in the memory 1220 and, when executed by the one or more processors 1210, perform the audio synthesis method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

The electronic device of embodiments of the present invention exists in a variety of forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, among others.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other onboard electronic devices with data interaction functions, such as a vehicle-mounted device mounted on a vehicle.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An audio synthesis method, comprising:

acquiring a text characteristic vector corresponding to a target sentence to be subjected to audio synthesis;

determining target codebook information corresponding to the text feature vector;

determining a target spectrogram corresponding to the target codebook information based on a preset codebook decoder;

and generating a synthetic audio corresponding to the target sentence according to the target spectrogram.

2. The method of claim 1, wherein the obtaining of the text feature vector corresponding to the target sentence to be subjected to audio synthesis comprises:

segmenting and marking the target sentence according to a preset word list to determine a corresponding marked word sequence;

respectively setting position indication tokens at the head and the tail of the marked word sequence to generate a token sequence;

and extracting the text feature vector corresponding to the token sequence based on a language model.

3. The method of claim 2, wherein the extracting the text feature vector corresponding to the token sequence based on the language model comprises:

padding the token sequence into a maximum number sequence having a fixed length such that the maximum number sequence contains the token sequence and at least one placeholder;

inputting the maximum number sequence and the attention mask code into a language model, so that the language model extracts a text feature vector corresponding to the token sequence; wherein the attention mask is to cause the language model to ignore each of the placeholders.

4. The method of claim 1, the target spectrogram is a Mel spectrum,

wherein, the generating the synthetic audio corresponding to the target sentence according to the target spectrogram includes:

and generating an antagonistic neural network based on a preset Mel spectrum, and generating a synthetic audio corresponding to the Mel spectrum.

5. The method of claim 1, wherein determining target codebook information corresponding to the text feature vector comprises:

determining a codebook index corresponding to the text feature vector;

inquiring a matched target codebook representation from a preset reference codebook according to the codebook index; the reference codebook comprises a plurality of codebook indices and corresponding codebook representations;

based on the target codebook representation, target codebook information is determined.

6. The method of claim 5, wherein the determining the codebook index corresponding to the text feature vector comprises:

and analyzing the text feature vector by using a machine learning model with a self-attention mechanism to obtain a codebook index corresponding to the text feature vector.

7. The method of claim 5, wherein the codebook decoder generates a competing network with vector quantization, and each codebook representation in the reference codebook is determined by the vector quantization generating competing network on a natural audio sample basis; the natural audio sample comprises an audio text pair with annotations, and the annotations are used for describing differentiation information of texts corresponding to audios in the audio text pair.

8. The method of claim 7, wherein the audio text pairs in the natural audio samples are determined based on audio caption data.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-8.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.