CN116665636A

CN116665636A - Audio data processing method, model training method, electronic device, and storage medium

Info

Publication number: CN116665636A
Application number: CN202211145922.2A
Authority: CN
Inventors: 龚雪飞
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2023-08-29
Anticipated expiration: 2042-09-20
Also published as: CN116665636B

Abstract

The embodiment of the application relates to a voice synthesis direction in the field of artificial intelligence, and provides an audio data processing method, a model training method, electronic equipment and a storage medium. The method is applied to the electronic equipment and comprises the following steps: acquiring an audio category corresponding to the acoustic feature to be processed; acquiring a clustering center vector corresponding to the audio category; inputting text, duration and clustering center vectors corresponding to the acoustic features to be processed into an audio encoder of an acoustic model, wherein the acoustic model is a model obtained by model training based on audio class samples and clustering center vector samples corresponding to the audio class samples; an output of the acoustic model is obtained. According to the technical scheme provided by the embodiment of the application, the acoustic model is trained based on the audio category and the clustering center vector, so that the complexity of the model is reduced, and the audio effect is ensured.

Description

Audio data processing method, model training method, electronic device, and storage medium

[ field of technology ]

The present application relates to the field of artificial intelligence, and in particular, to an audio data processing method, a model training method, an electronic device, and a storage medium.

[ background Art ]

Speech synthesis technology, i.e. converting input text information into audible sound information, can speak the content to be expressed by different tone colors.

An end-To-end Speech synthesis (TTS) system is the mainstream Speech synthesis system framework. The terminal equipment needs to use TTS technical capabilities of multiple suppliers in a preassembling and purchasing mode, tone colors are not unified, voice broadcasting effects of different services on the same terminal equipment are inconsistent, voice broadcasting effects on different terminal equipment are inconsistent, voice quality of voice synthesized by TTS engines installed on the terminal equipment is poor, and mechanical sense is strong.

[ application ]

In view of this, the embodiments of the present application provide an audio data processing method, a model training method, an electronic device, and a storage medium for guaranteeing an audio effect while reducing complexity of a model.

In a first aspect, an embodiment of the present application provides an audio data processing method, where the method is applied to an electronic device, and the method includes:

acquiring an audio category corresponding to the acoustic feature to be processed;

acquiring a clustering center vector corresponding to the audio category;

inputting text, duration and clustering center vectors corresponding to the acoustic features to be processed into an audio encoder of an acoustic model, wherein the acoustic model is a model obtained by model training based on audio class samples and clustering center vector samples corresponding to the audio class samples;

An output of the acoustic model is obtained.

According to the implementation mode provided by the embodiment of the application, the acoustic characteristics are classified based on the audio categories, so that the complexity of an acoustic model can be reduced and the audio effect can be ensured.

In an implementation manner of the first aspect, acquiring an audio class corresponding to an acoustic feature to be processed includes:

acquiring the acoustic features to be processed and texts and duration corresponding to the acoustic features to be processed;

averaging the acoustic features to be processed by combining the text and the duration time to generate an acoustic feature mean value;

based on a clustering algorithm, generating an audio category corresponding to the acoustic feature to be processed according to the acoustic feature mean value.

According to the implementation mode provided by the embodiment of the application, the expressive force of speech synthesis can be improved, and the rhythm sense of the synthesized audio can be increased, so that the speech quality of speech synthesis is improved.

In one implementation of the first aspect, the training process of the acoustic model includes:

acquiring a first audio class sample;

acquiring a first clustering center vector sample corresponding to a first audio class sample;

inputting the first cluster center vector samples to an audio encoder of an acoustic model, obtaining a first encoded output for the first cluster center vector samples;

Predicting a first predicted audio class corresponding to the first encoded output based on the first encoded output;

a decoder of the acoustic model is trained from the first predicted audio class and the first audio class samples.

In one implementation of the first aspect, a decoder for training an acoustic model from a first predicted audio class and a first audio class sample, comprises:

performing loss calculation aiming at the first audio class sample and the first predicted audio class to acquire a loss value;

the decoder of the acoustic model is trained on the loss values.

According to the implementation mode provided by the embodiment of the application, the acoustic model is trained based on the loss value, so that the accuracy of the acoustic model is improved.

In one implementation manner of the first aspect, the method further includes:

calculating a first predictive cluster center vector corresponding to the first predictive audio class according to the first encoded output;

and a decoder for inferring an acoustic model based on the first encoded output and the first predictive cluster center vector.

According to the implementation mode provided by the embodiment of the application, the acoustic characteristics are classified based on the phoneme category, so that the complexity of the acoustic model can be reduced.

In one implementation manner of the first aspect, acquiring a first audio class sample includes:

Acquiring a text and duration corresponding to the acoustic characteristics of the sample;

averaging the acoustic features of the sample by combining the text and the duration to generate an acoustic feature average value;

based on a clustering algorithm, a first audio class sample and a first clustering center vector sample are generated according to the acoustic feature mean.

In an implementation manner of the first aspect, before acquiring the audio category corresponding to the acoustic feature to be processed, the method further includes:

acquiring a text, wherein the content of the text comprises a first language and a second language;

splitting the text to generate a plurality of text words and parts of speech corresponding to each text word, wherein the method comprises the following steps: splitting the text based on the language difference between the first language and the second language;

vectorization is carried out on text words and parts of speech, and vectorization results are obtained;

aiming at the vectorization result corresponding to the first language, performing prosody prediction calculation to obtain a first prosody prediction result;

aiming at the vectorization result corresponding to the second language, performing prosody prediction calculation to obtain a second prosody prediction result;

and mixing and outputting the first prosody prediction result and the second prosody prediction result, and outputting the multilingual mixed prosody prediction result.

According to the implementation mode provided by the embodiment of the application, miniaturization of the prosody prediction model can be realized, and the multilingual mixed prosody prediction result can be output.

In an implementation manner of the first aspect, splitting the text to generate a plurality of text words and parts of speech corresponding to each text word, and further includes:

and adding corresponding language marks for the text words.

In one implementation manner of the first aspect, vectorizing is performed for a text word and a part of speech to obtain a vectorized result, including:

performing text vectorization on the text words to generate text vectors;

carrying out language identification code vectorization on the text words to generate language vectors;

and performing part-of-speech vectorization on the parts of speech to generate part-of-speech vectors.

In an implementation manner of the first aspect, performing prosody prediction calculation for the vectorization result corresponding to the first language to obtain a first prosody prediction result, including:

aiming at the vectorization result corresponding to the first language, calculating through bidirectional long-short-time memory to generate a first predictive vector;

calculating a first prediction vector through a fuzzy neural network to generate a first training vector;

and calculating the first training vector through a normalized exponential function to generate a first prosody prediction result.

In an implementation manner of the first aspect, the content of the text further includes a third language;

splitting the text to generate a plurality of text words and parts of speech corresponding to each text word, wherein the method comprises the following steps: splitting the text based on the language difference among the first language, the second language and the third language;

the method further comprises the steps of:

aiming at the vectorization result corresponding to the third language, performing prosody prediction calculation to obtain a third prosody prediction result;

and mixing and outputting the first prosody prediction result, the second prosody prediction result and the third prosody prediction result, and outputting the multilingual mixed prosody prediction result.

In a second aspect, an embodiment of the present application provides a model training method, where the method is applied to an electronic device, and the method includes:

acquiring a first audio class sample;

In one implementation of the second aspect, a decoder for training an acoustic model from a first predicted audio class and a first audio class sample, comprises:

the decoder of the acoustic model is trained on the loss values.

In one implementation manner of the second aspect, the method further includes:

In one implementation manner of the second aspect, acquiring the first audio class sample includes:

In a third aspect, an embodiment of the application provides an electronic device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the electronic device to perform the method steps as described in the first aspect.

In a fourth aspect, an embodiment of the application provides an electronic device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the electronic device to perform the method steps as described in the second aspect.

In a fifth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the methods according to the first and second aspects.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a speech synthesis system according to an embodiment of the present application;

fig. 2 is a flow chart of an audio data processing method according to an embodiment of the application;

FIG. 3 is a flowchart illustrating another audio data processing method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio data processing device according to an embodiment of the present application;

FIG. 5 is a flow chart of model training according to one embodiment of the present application;

FIG. 6 is a flow chart of a model training sample acquisition according to an embodiment of the present application;

FIG. 7 is a flow chart of another model reasoning provided in accordance with an embodiment of the present application;

FIG. 8 is a schematic diagram of an acoustic model structure according to an embodiment of the present application;

fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the application.

[ detailed description ] of the application

For a better understanding of the technical solution of the present application, the following detailed description of the embodiments of the present application refers to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one way of describing an association of associated objects, meaning that there may be three relationships, e.g., a and/or b, which may represent: the first and second cases exist separately, and the first and second cases exist separately. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Running TTS systems On end-side devices requires full consideration of power consumption, latency, read-only Memory (ROM), random access Memory (Random Access Memory RAM) metrics.

Fig. 1 is a schematic structural diagram of a speech synthesis system according to an embodiment of the present application, where, as shown in fig. 1, the speech synthesis system includes: a text regularization module 11, a prosody prediction module 12, a phonetic notation module 13, an acoustic model 14 and a vocoder 15. The text regularization module 11, the prosody prediction module 12 and the phonetic notation module 13 are front-end modules, and the acoustic model 14 and the vocoder 15 are back-end modules.

The text regularization module 11 is configured to convert telephone, time, money, units, symbols, mailboxes, dates, etc. into standardized text, i.e., text abbreviated or abbreviated in the input text, using a regular expression. For example, sep.11th needs to be standardized as September E l eventh.

The prosody prediction module 12 is configured to perform front-end prosody prediction using a deep network to predict pauses and/or accents between words in sentences of text.

Wherein prosodic text data is required to train the prosodic predictive model.

The phonetic notation module 13 is configured to convert characters into pinyin by using a deep network, and solve the problem of polyphones, where polyphone data is required to train a phonetic notation model, and phonetic notation can be performed by using a rule of polyphone word segmentation+set.

For example, "model" and "pattern" where the "model" word is a polyphonic word, the "model" word is a different tone. Therefore, when a sentence is input, the front-end module needs to accurately judge the pronunciation of the word to generate pronunciation information.

For another example, speech is a glyph of this text that needs to be converted into phonemes s p iy ch first to obtain linguistic information.

The multi-tone character data is needed to train the phonetic notation model, and the phonetic notation can be carried out by adopting a multi-tone character word segmentation and set rule.

The acoustic model 14 is used for converting pinyin to audio acoustic features by a parallel computing network, wherein tone training can be performed through audio corpus recorded by sound optimization, and Chinese and English hybrid coding experiments can be performed. That is, acoustic features are generated based on pronunciation information or linguistic information generated by the front-end module, the acoustic features including a mel-frequency spectrogram.

The vocoder 15 is used to synthesize a waveform diagram of sound by generating an acoustic feature against a network (Generat ive Adversar i a l Networks, GAN for short) to output audio.

For example, input text: the hexa zone is now 4 ℃. The text regularization module 11 treats the hexa-junction region now at 4 ℃ as: the six-way zone is now four degrees celsius. The prosody prediction module 12 processes the six-way region now four degrees celsius as: six-zone #2 is now #1 four #1 degrees celsius #3. The ZhuYin Module 13 processes the six-way zone #2 now #1 four #1 degrees Celsius #3 as: l u4 he2 qu1#2x i an4 za i4#1s i4#1she4 sh i4 du4#3. The acoustic model 14 treats l u he2 qu1#2x i an4 za i4#1s i4#1she4 sh i4 du4#3 as an acoustic feature. The vocoder 15 synthesizes acoustic features into a waveform map of sound to output audio.

For another example, text is entered: i deposit 300 money to the bank today. Text regularization: i have saved three hundred money to the bank today. Word segmentation: i/today/go/bank/deposit/three hundred/block/money. Part-of-speech analysis: i (noun)/today (date d)/go (verb v)/bank (noun)/store (verb v)/three hundred (number num)/block (graduated q)/money (noun). Prosody prediction: i am #2 today #1 goes to #1 bank i am #2 stores #1 three hundred money #4.

In the speech synthesis system shown in fig. 1, the prosody prediction implemented by the prosody prediction module 12 is a problem in the field of natural language processing (Natura l Language Process i ng, abbreviated as NLP), and one possible solution to this problem is to use a pre-training model followed by a multi-classification network.

In one embodiment of the present application, the pre-training model comprises: a pre-trained language characterization model (B id i rect iona l Encoder Representat ion from Transformers, called Bert for short), a chinese pre-training model (Al Bert), a correlation model (Word to vector, called Word2Vec for short) for generating Word vectors, or a ultra-small chinese pre-training model (T i ny a l Bert) and the like.

Fig. 2 is a flowchart of an audio data processing method according to an embodiment of the application. The data processing method is a front-end prosody prediction method, which comprises the following steps: after inputting the Text (I nput Text) into the prosody prediction module 12, the prosody prediction module 12 performs the following steps as shown in fig. 2.

S11, word segmentation is carried out on the text, and a plurality of text words and parts of speech corresponding to the text words are generated.

For example, the text is "I'm is very happy today," the text is segmented, and a plurality of text words "I", "today", "very" and "happy" are generated. The parts of speech corresponding to each text word are respectively: the parts of speech corresponding to "me" is a person's pronoun, the parts of speech corresponding to "today" is a noun, the parts of speech corresponding to "very" is an adverb, and the parts of speech corresponding to "open heart" is an adjective.

S121, training a text vectorization model according to the pre-training model.

S122, text vectorization (Char embedd i ng) is carried out on the text words generated in S11 based on the text vectorization model in S121, and a text vector is generated.

S123, performing part-of-speech vectorization (Pos embedd i ng) on the part of speech generated in S11 to generate a part-of-speech vector.

S131, calculating the text vector generated in S122 and the part-of-speech vector generated in S123 through a bidirectional long short time memory (b id i rect iona l l org short term memory, abbreviated as BLSTM) and a fuzzy neural network (Factor i sat ion Mach i ne supported Neura l Network, abbreviated as FNN) to generate a training vector.

Specifically, the BLSTM trains and predicts the text vector generated in S122 and the part-of-speech vector generated in S123, generating a prediction vector.

The FNN calculates a hidden vector of a factor decomposition machine (Factor i zat i on Mach i ne, FM) for the predicted vector to generate a training vector, and the training vector is a trained vector based on the FNN.

S141, calculating the training vector generated in S131 through a normalized exponential function (Softmax) to generate a prosody prediction result.

Based on the flow shown in fig. 2, the pre-training model is followed by a multi-classification network to achieve the front-end prosody prediction. The data volume of the multi-class network is particularly large, and the equipment on the end side cannot be landed.

In view of the above, an embodiment of the present application provides a prosody prediction method.

Fig. 3 is a flowchart of another audio data processing method according to an embodiment of the application.

After inputting the Text (I nput Text) into the prosody prediction module 12, the prosody prediction module 12 performs the following steps as shown in fig. 3.

S31, splitting the text to generate a plurality of text words (subwords) and parts of speech corresponding to each text word. In the process of splitting the text, splitting is performed based on different languages, and corresponding language marks are added.

Specifically, the content of the text includes a first language and a second language; splitting the text to generate a plurality of text words and parts of speech corresponding to each text word, wherein the method comprises the following steps: the text is split based on the language difference between the first language and the second language.

The following description will be given by taking Chinese (first language)/English (second language) as an example.

It should be noted that the present application is not limited to languages (languages) in text splitting based on different languages (languages). In particular implementations, one skilled in the art can split text based on any feasible language (language) distinction. For example, text splitting is performed for chinese/german, text splitting is performed based on chinese/english/german.

For example, in one embodiment, split labels (language labels) are added for different languages (languages), 0 representing Chinese and 1 representing English. Table one is a text split table.

TABLE 1DD222102I01

Entering text "is no longer willing to easily go low to the neighborhood. He l o's, resolution results are shown in table 1.

Text words (subwords) are "re/nor/willing/oriented/neighbor/country/light/easy/low/head/finished/. He l o/'/s).

The text is segmented to generate text words of're/also/unwilling/going/adjacent country/easy/low head//. He l o's. The parts of speech of the "re" are marked as "d", the parts of speech of the "re" are marked as "0", the parts of speech of the "also" are marked as "d", the parts of speech of the "re" are marked as "0", the parts of speech of the "non" are marked as "d", the parts of speech of the "re" are marked as "0", the parts of speech of the "wish" are marked as "v", the parts of speech of the "re" are marked as "0", the parts of speech of the "direction" are marked as "p", the parts of speech of the "adjacent country" are marked as "n", the parts of speech of the "re" are marked as "0", the parts of speech of the "easy" are marked as "d", the parts of speech of the "low head" are marked as "v", the parts of speech of the "split are marked as" 0", and the parts of speech of the" wish "are marked as" y ". The parts of speech of the 'are marked as' w ', the parts of speech of the' He l l o 'are marked as' n ', the parts of speech of the' are marked as '1', the parts of speech of the's' are marked as 'n', the parts of speech of the's' are marked as '1', the parts of speech of the's' are marked as 'n', and the parts of speech of the's' are marked as '1'.

S322, performing text vectorization (Char embedd i ng) on the text words generated in S31 to generate text vectors.

S323, performing language identification code vectorization (LangI D embedd i ng) on the text words generated in S31 to generate language vectors.

S324, performing part-of-speech vectorization (Pos embedd i ng) on the part of speech generated in S31 to generate a part-of-speech vector.

Then, based on the result of S323, the vectors of different languages are distinguished, and prosody prediction calculation is performed for the vectors of different languages, respectively.

Specifically, a Chinese/English text division is taken as an example.

S331, a first prediction vector is generated by calculating a vector corresponding to the Chinese text from among the text vector generated in S322, the language vector generated in S323 and the part-of-speech vector generated in S324 through a bidirectional long short time memory (b id i rect iona l l org short term memory, abbreviated as BLSTM).

S341, calculating the first prediction vector generated in S331 through a fuzzy neural network (Factor i sat i on Mach i ne supported Neura l Network, FNN for short) to generate a first training vector.

S351, calculating the first training vector generated in S341 through a normalized exponential function (Softmax), and generating a Chinese prosody prediction result to Output a Chinese prosody prediction result (CN Output) (first prosody prediction result).

And S332, calculating a vector corresponding to the English text in the text vector generated in the S322, the language vector generated in the S323 and the part-of-speech vector generated in the S324 through a bidirectional long short time memory (b id i rect iona l l org short term memory, abbreviated as BLSTM) to generate a second prediction vector.

S342, the second prediction vector generated in S332 is calculated by a fuzzy neural network (Factor i sat i on Mach i ne supported Neura l Network, abbreviated as FNN) to generate a second training vector.

And S352, calculating the second training vector generated in the S342 through a normalized exponential function (Softmax), and generating an English prosody prediction result to Output an English prosody prediction result (EN Output) (second prosody prediction result).

After S351 and S352, the prosody prediction result of chinese (S351) and the prosody prediction result of english (S352) are mixed and output, and a prosody prediction result of mixed chinese and english (prosody prediction result of mixed multilingual) is output.

In the embodiment shown in fig. 2, because TTS needs to support multilingual (e.g., chinese-english) mixed-speaking, vocabulary data is huge (e.g., english vocabulary needs to include a plurality of whole english words).

In the embodiment shown in fig. 3, the word splitting of the input text based on language may reduce the data amount of the vocabulary (for example, the vocabulary includes english word source, word root and word affix, the vocabulary data amount may be reduced by about 90%), thereby reducing the parameter amount of the vectorization (emmbedd i ng) model (one or more models used in S322, S323, S324, S331, S332, S341, S432, S351, S352).

Further, the vectorization (emmbedd i ng) model supports a multilingual hybrid model, which can further reduce the amount of model data. For example, the word list emmbedd i ng may be trained, a multi-lingual multi-task prosody prediction model for multi-tasks may be constructed, and the results corresponding to the subwords may be combined.

In the technical scheme of the data processing method provided by the embodiment of the application, the text is acquired, and the content of the text comprises a first language and a second language; splitting the text to generate a plurality of text words and parts of speech corresponding to each text word, wherein the method comprises the following steps: splitting the text based on the language difference between the first language and the second language; vectorization is carried out on text words and parts of speech, and vectorization results are obtained; aiming at the vectorization result corresponding to the first language, performing prosody prediction calculation to obtain a first prosody prediction result; aiming at the vectorization result corresponding to the second language, performing prosody prediction calculation to obtain a second prosody prediction result; the first prosody prediction result and the second prosody prediction result are output in a mixed mode, the mixed multilingual prosody prediction result is output, and miniaturization of the prosody prediction model can be achieved.

Further, in the speech synthesis system shown in fig. 1, the acoustic model 14 learns the fine-grained prosody information of the audio by learning the feature information (phoneme level) of the audio, and performs one-to-one correspondence with the input audio, and a prosody prediction module may be added in the inference.

Specifically, fig. 4 is a schematic structural diagram of an audio data processing device according to an embodiment of the present application. The data processing device is a prosody prediction module, and the acoustic model 14 implements fine-grained prosody prediction based on the prosody prediction module shown in fig. 4.

As shown in fig. 4, an input Text (I nput Text) is acquired, the input Text is input to a language identification code (LangI D) module 211, and the language identification code (LangI D) module 211 performs language identification on the input Text to generate a language identification code.

The language identification code module 211 inputs the language identification code generated in the language identification code module 211 into a language vectorization module (Lang embedded i ng) 212, and the language vectorization module 212 performs language vectorization on the language identification code to generate a language vector.

Text is input to a Text vectorization module (Text embedded i ng) 213, and the Text vectorization module 213 vectorizes the Text to generate a Text vector.

The language vectorization module 212 inputs a language vector into an attention (transducer) model 214 and the text vectorization module 213 inputs a text vector into the attention model 214.

The transducer model 214 extracts global features from the language vectors generated by the language vectorization module (Lang embedded i ng) 212 and the Text vectors generated by the Text vectorization module (Text embedded i ng) 213.

The acoustic features to be processed are obtained, averaged (AVG) with the duration, and input to an audio encoder (Phone encoder) to generate an encoding result, where the audio encoder includes 2 sets of superimposed one-dimensional convolution (Conv 1D) +linear rectification function (L i near rect i f icat ion funct ion, re L u) module 215, normalization processing (LN) +discard (Dropout) module 216, and linear Layer (li near Layer) 217.

The normalization processing (LN) +dropping (Dropout) module 216 includes an LN operation and a Dropout operation, where the LN operation is to solve the problem of gradient extinction, and the Dropout operation is to solve the problem of overfitting.

The global features generated by duration (duration), attention (transducer) model 214 and the encoding results generated by the audio encoder (Phone encoder) are input to an upsampling (LR) module 218 to generate a speech signal.

The upsampling (LR) module 218 sends the speech signal to the transducer model 219, which the transducer model 219 processes to generate speech features.

The transducer model 219 sends the speech features to a linear projection module (L i near project ion) 220, which processes the speech features by the linear projection module 220 to generate linear prediction results.

The linear projection module 220 sends the linear prediction result to a 5-layer convolved post-processing network (5 conv l layer post-net) 221,5-layer convolved post-processing network 221, which processes the linear prediction result to generate a reconstructed speech.

The linear prediction result generated by the linear projection module 220 is output-added (residual connection) to the reconstructed speech generated by the 5-layer convolved post-processing network 221 to generate an audio acoustic feature.

Based on the structure shown in fig. 4, the acoustic model 14 learns fine-grained prosodic information, which is a regression problem, so that the fitting difficulty is high during the inference, and the model complexity needs to be increased for learning, which is not beneficial to the deployment of end-side equipment.

In view of the foregoing, in an embodiment of the present application, an acoustic model fine-granularity prosody prediction method is provided. And clustering the acoustic features to obtain an audio category (Phone category) and a corresponding clustering center vector (central vector).

And training an acoustic model based on the audio class and the clustering center vector, and converting the acoustic model learning fine-granularity rhythm information from a regression problem to a classification problem, so that the complexity of the model is reduced on the premise of ensuring the model effect.

Fig. 5 is a flow chart of model training according to an embodiment of the present application.

S500, acquiring an audio category for training a model and a corresponding clustering center vector.

Specifically, in S500, the acoustic feature mean value corresponding to the phoneme level is obtained, the acoustic feature mean value is clustered into N categories (N is a super parameter) by using a clustering algorithm, and the cluster center vectors of the N categories are obtained.

Fig. 6 is a flowchart of model training sample acquisition according to an embodiment of the present application. The electronic device performs the following steps as shown in fig. 6 to implement S500.

Input Text (I nput Text), duration (duration), and acoustic features (sample acoustic features) are acquired, and the acoustic features are Averaged (AVG) in combination with the input Text and duration to generate an acoustic feature average.

Specifically, the acoustic features include mel (me l), linear prediction coefficients (lpc), linear spectra, and the like.

The acoustic feature mean is input to a clustering algorithm module 411, which generates N audio categories (Phone category) and cluster center vectors of the audio categories (central vector 1, central vector 2 … Centra l vector N).

Specifically, in one embodiment, the clustering algorithm module 411 is configured with a clustering algorithm that includes: mean shift clustering, K-means clustering algorithm (K-kmeans), self-organizing map (Se l f-Organ i z i ng Map, SOM for short), fuzzy C-means (FCM) clustering algorithm, and the like.

S510, selecting a first real audio category (a first audio category sample), wherein the first real audio category is one real audio category generated in S500.

S520 inputs the input Text (I nput Text) to an audio encoder (phone encoder) of the acoustic model, outputting a first encoded output.

S530, predicting a first predicted audio category corresponding to the first encoded output according to the first encoded output.

Specifically, a prosody level prediction module (audio class predictor (Phone category pred ict)) is trained for generating a corresponding audio class from the output of the audio encoder. S530 is implemented based on the audio class predictor (Phone category pred ict).

S540, performing loss calculation (l oss) on the first real audio category and the first predicted audio category to obtain a loss value.

S550, training a decoder of the acoustic model according to the loss value calculated in S540.

The decoder of the acoustic model includes, among other things, a transducer model 521 and a linear projection module 522 in fig. 8.

Further, the decoder of the acoustic model is also inferred (optimized) on the basis of training the decoder of the acoustic model.

Fig. 7 is a flow chart of a model reasoning provided in accordance with an embodiment of the present application.

As shown in fig. 7, the method includes:

s400, acquiring an input Text (I nput Text), inputting the input Text into an audio encoder (phone encoder) of an acoustic model, and outputting a first encoded output.

S410, the first code output in S400 is input into an audio class predictor (Phone category pred ict) to generate a first prediction cluster center vector.

Specifically, S410 includes a formula based on:

Centra l_V ＝ ∑ ^N _i＝0 (probab i ty_i x Centra i vector i) (formula 1)

A first predictive cluster center vector is calculated.

In formula 1:

the central_V is a predictive clustering center vector;

the central_vector_i is a clustering center vector corresponding to the ith audio class;

probab i is the probability value corresponding to the i-th audio class.

The audio class predictor (Phone category pred ict) is a multi-classification network, inputs the first code output into the network, and can output audio classes, and takes the probability value corresponding to each audio class from the upper layer of the network.

S420, deducing a decoder of the acoustic model according to the first coding output from S400 and the first prediction clustering center vector calculated from S410.

Fig. 8 is a schematic structural diagram of an acoustic model according to an embodiment of the present application.

As shown in fig. 8, the variable information predictor (var i ance pred ictor) 524 is configured to predict acoustic assist features (e.g., resonant frequency f0, energy, audio category and/or duration i on) from the global features sent by the transducer model 521 via a pre-trained network model.

The variable information predictor 524 under the non-autoregressive speech synthesis model (Fastspeech) framework includes: a one-dimensional convolution + linear rectification function module 525, a normalization process + discard module 526, a one-dimensional convolution + linear rectification function module 527, a normalization process + discard module 528, and a linear layer 529.

The audio class predictor is implemented based on a variable information predictor (var i ance pred ictor) 524. That is, S530 and S410 are implemented based on the variable information predictor (var i ance pred i ctor) 524.

Further, after the acoustic model is trained, as shown in fig. 8, the operation process of the acoustic model includes:

and acquiring acoustic characteristics to be processed.

And acquiring an audio category (Phone category) corresponding to the acoustic feature to be processed according to the acoustic feature to be processed.

And searching cluster center vectors (central vector 1 and central vector 2 … Centra l vector N) corresponding to the audio categories.

The cluster center vectors (central vector 1, central vector 2 … Centra l vector N) are input into an audio encoder (Phone encoder), and the audio encoder calculates the cluster center vectors (central vector 1, central vector 2 … Centra l vector N) corresponding to the N audio categories to generate an encoding result.

The audio encoder includes a one-dimensional convolution (Conv 1D) +linear rectification function (L i near rect i f icat ion funct ion, re u for short) module 517, a normalization process (LN) +discard (Dropout) module 518, and a linear Layer (L i near Layer) 519.

The normalization processing (LN) +dropping (Dropout) module 518 includes an LN operation and a Dropout operation, where the LN operation is to solve the problem of gradient extinction, and the Dropout operation is to solve the problem of overfitting.

Text is acquired, a Text input (I nput Text) language identification code (LangI D) module 513, and the language identification code (LangI D) module 513 performs language identification on the input Text to generate a language identification code.

The language identification code module 513 inputs the language identification code generated in the language identification code module 513 into a language vectorization module (Lang embedded i ng) 514, and the language vectorization module 514 performs language vectorization on the language identification code to generate a language vector.

Text is input to Text vectorization module (Text embedded i ng) 515, and Text vectorization module 515 vectorizes the Text to generate a Text vector.

The language vectorization module 514 inputs language vectors into the transducer model 516, the text vectorization module 515 inputs text vectors into the transducer model 516, and the transducer model 516 computes the language vectors and the text vectors to generate global features.

The global features generated by duration (duration), attention (transducer) model 516 and the encoding results generated by the audio encoder (Phone encoder) are input to an upsampling (LR) module 520 to generate a speech signal.

The upsampling (LR) module 520 sends the speech signal to the transducer model 521, which processes the speech signal to generate speech features.

The transducer model 521 sends the speech features to a linear projection module (L i near project ion) 522, which processes the speech features by the linear projection module 522 to generate linear prediction results.

The linear projection module 522 sends the linear prediction result to a 5-layer convolved post-processing network (5 conv l layer post-net) 523,5-layer convolved post-processing network 523, which processes the linear prediction result to generate a reconstructed speech.

The linear prediction result generated by the linear projection module 522 is output-added (residual connection) with the reconstructed speech generated by the 5-layer convolved post-processing network 523 to generate an audio acoustic feature.

According to the acoustic model shown in fig. 8, the audio class (Phone category) and the corresponding clustering center vector (Centra l vector) are used as inputs, and the acoustic model learning fine-granularity rhythm information is converted from the regression problem to the classification problem, so that the complexity of the model is reduced on the premise of ensuring the model effect.

An embodiment of the application also proposes a computer-readable storage medium in which a computer program is stored which, when run on a computer, causes the computer to perform the above-mentioned method.

An embodiment of the application also proposes an electronic device comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the electronic device to perform the method steps according to the embodiments of the application.

In particular, in one embodiment of the present application, the one or more computer programs are stored in the memory, where the one or more computer programs include instructions that, when executed by the apparatus, cause the apparatus to perform the method steps described in the embodiments of the present application.

Fig. 9 is a schematic diagram illustrating a hardware structure of an electronic device according to an embodiment of the application. As shown in fig. 9, the electronic device may include a processor 100, a communication module 120, a display 130, an indicator 140, an internal memory 150, an external memory interface 160, a universal serial bus (universal serial bus, USB) interface 170, a power management module 180, and the like.

It should be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the electronic device. In other embodiments of the application, the electronic device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 100 of the electronic device may be a device-on-chip SOC, which may include a central processing unit (Central Processing Unit, CPU) therein, and may further include other types of processors. For example, the processor 100 may be a PWM control chip.

The processor 100 may include, for example, a CPU, DSP, microcontroller, or digital signal processor, and may further include a GPU, an embedded Neural network processor (Neural-network Process Units, NPU), and an image signal processor (Image Signal Processing, ISP), and the processor 100 may further include a necessary hardware accelerator or logic processing hardware circuit, such as an ASIC, or one or more integrated circuits for controlling the execution of the program according to the present application, and the like. Further, the processor 100 may have a function of operating one or more software programs, which may be stored in a storage medium.

Processor 100 may include one or more processing units. For example: processor 100 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate components or may be integrated in one or more processors. In some embodiments, the electronic device may also include one or more processors 100. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

In some embodiments, processor 100 may include one or more interfaces. The interfaces may include inter-integrated circuit (inter-integrated circuit, I2C) interfaces, inter-integrated circuit audio (integrated circuit sound, I2S) interfaces, pulse code modulation (pulse code modulation, PCM) interfaces, universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interfaces, mobile industry processor interfaces (mobile industry processor interface, MIPI), general-purpose input/output (GPIO) interfaces, and/or USB interfaces, among others. The USB interface 170 is an interface conforming to the USB standard, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 170 may be used to transfer data between an electronic device and a peripheral device.

It should be understood that the connection relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device. In other embodiments of the present application, the electronic device may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The external memory interface 160 may be used to connect external memory, such as a removable hard disk, to enable expansion of the storage capabilities of the electronic device. The external memory card communicates with the processor 100 through an external memory interface 160 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

The internal memory 150 of the electronic device may be used to store one or more computer programs, including instructions. The processor 100 may cause the electronic device to perform the methods provided in some embodiments of the present application, as well as various applications, data processing, etc., by executing the above-described instructions stored in the internal memory 150. The internal memory 150 may include a code storage area and a data storage area. Wherein the code storage area may store an operating system. The data storage area may store data created during use of the electronic device, etc. In addition, the internal memory 150 may include high-speed random access memory, and may also include non-volatile memory, such as one or more disk storage units, flash memory units, universal flash memory (universal flash storage, UFS), and the like.

The internal memory 150 may be a read-only memory (ROM), other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media, or other magnetic storage devices, or any computer readable medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

Processor 100 and internal memory 150 may be combined into a single processing device, more commonly separate components, and processor 100 is configured to execute program code stored in internal memory 150 to implement the methods described in embodiments of the present application. In particular, internal memory 150 may also be integrated into the processor or may be separate from the processor.

The power management module 180 is used to power the electronic device.

The power management module 180 is used to connect the battery with the processor 100. The power management module 180 receives battery input to power the processor 100, the internal memory 150, the external memory interface 160, the communication module 120, and the like. The power management module 180 may also be configured to monitor battery capacity, battery cycle times, battery health (leakage, impedance) and other parameters. In other embodiments, the power management module 180 may also be provided in the processor 100.

The communication functions of the electronic device may be implemented by the communication module 120, a modem processor, a baseband processor, and the like.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor displays via display 130. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the communication module 120 or other functional modules, independent of the processor 100.

The communication module 120 may provide solutions for wireless communication including wireless local area networks (wireless local area networks, WLAN), such as wireless fidelity (wireless fidelity, wi-Fi) networks, bluetooth (BT), global navigation satellite systems (global navigation satellite system, GNSS), etc., as applied to electronic devices. The communication module 120 may be one or more devices that integrate at least one communication processing module. The communication module 120 modulates the electromagnetic wave signal and filters the signal, and transmits the processed signal to the processor 100. The communication module 120 may also receive a signal to be transmitted from the processor 100, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation.

Further, the devices, apparatuses, modules illustrated in the embodiments of the present application may be implemented by a computer chip or entity, or by a product having a certain function.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein.

In several embodiments provided by the present application, any of the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application.

In particular, in one embodiment of the present application, there is further provided a computer readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the method provided by the embodiment of the present application.

An embodiment of the application also provides a computer program product comprising a computer program which, when run on a computer, causes the computer to perform the method provided by the embodiment of the application.

The description of embodiments of the present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (means) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In the embodiments of the present application, the term "at least one" refers to one or more, and the term "a plurality" refers to two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any combination of these items, including any combination of single or plural items. For example, at least one of a, b and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

In embodiments of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

Those of ordinary skill in the art will appreciate that the various elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as a combination of electronic hardware, computer software, and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, the apparatus and the units described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The foregoing is merely exemplary embodiments of the present application, and any person skilled in the art may easily conceive of changes or substitutions within the technical scope of the present application, which should be covered by the present application. The protection scope of the present application shall be subject to the protection scope of the claims.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims

1. A method of audio data processing, the method being applied to an electronic device, the method comprising:

obtaining a clustering center vector corresponding to the audio category;

inputting the text, duration and the clustering center vector corresponding to the acoustic feature to be processed into an audio encoder of an acoustic model, wherein the acoustic model is a model obtained by model training based on an audio class sample and a clustering center vector sample corresponding to the audio class sample;

An output of the acoustic model is obtained.

2. The method according to claim 1, wherein the obtaining the audio category corresponding to the acoustic feature to be processed includes:

acquiring the text and the duration corresponding to the acoustic feature to be processed;

averaging the acoustic features to be processed by combining the text and the duration to generate an acoustic feature average value;

3. The method of claim 1, wherein the training process of the acoustic model comprises:

acquiring a first audio class sample;

acquiring a first clustering center vector sample corresponding to the first audio class sample;

inputting the first cluster center vector samples to an audio encoder of the acoustic model, obtaining a first encoded output for the first cluster center vector samples;

predicting a first predicted audio class corresponding to the first encoded output from the first encoded output;

4. A method according to claim 3, wherein the training the decoder of the acoustic model from the first predicted audio class and the first audio class samples comprises:

and training a decoder of the acoustic model according to the loss value.

5. A method according to claim 3, characterized in that the method further comprises:

and a decoder for reasoning the acoustic model according to the first coding output and the first predictive clustering center vector.

6. A method according to claim 3, wherein said obtaining a first audio class sample comprises:

acquiring sample acoustic features and texts and duration corresponding to the sample acoustic features;

averaging the acoustic features of the sample by combining the text and the duration to generate an acoustic feature average;

and generating the first audio class sample and the first clustering center vector sample according to the acoustic feature mean value based on a clustering algorithm.

7. The method according to any one of claims 1-6, wherein prior to the obtaining the audio category corresponding to the acoustic feature to be processed, the method further comprises:

vectorization is carried out on the text words and the parts of speech to obtain vectorization results;

and mixing and outputting the first prosody prediction result and the second prosody prediction result, and outputting a multilingual mixed prosody prediction result.

8. The method of claim 7, wherein the splitting the text to generate a plurality of text words and parts of speech corresponding to each text word further comprises:

And adding corresponding language marks for the text words.

9. The method of claim 7, wherein the vectorizing the text word and the part of speech to obtain a vectorized result comprises:

performing text vectorization on the text words to generate text vectors;

10. The method of claim 7, wherein performing prosody prediction calculation for the vectorized result corresponding to the first language to obtain a first prosody prediction result comprises:

calculating the vectorization result corresponding to the first language through a bidirectional long-short-time memory to generate a first predictive vector;

calculating the first prediction vector through a fuzzy neural network to generate a first training vector;

and calculating the first training vector through a normalized exponential function to generate the first prosody prediction result.

11. The method of claim 7, wherein the content of the text further comprises a third language;

Splitting the text to generate a plurality of text words and parts of speech corresponding to each text word, wherein the splitting comprises the following steps: splitting the text based on the language difference among the first language, the second language and the third language;

the method further comprises the steps of:

12. A model training method, wherein the method is applied to an electronic device, the method comprising:

acquiring a first audio class sample;

inputting the first cluster center vector sample to an audio encoder of an acoustic model, obtaining a first encoded output for the first cluster center vector sample;

13. The method of claim 12, wherein the training the decoder of the acoustic model from the first predicted audio class and the first audio class samples comprises:

and training a decoder of the acoustic model according to the loss value.

14. The method according to claim 12, wherein the method further comprises:

15. The method of claim 12, wherein the obtaining a first audio class sample comprises:

16. An electronic device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the electronic device to perform the method steps of any of claims 1-11.

17. An electronic device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the electronic device to perform the method steps of any of claims 12-15.

18. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when run on a computer, causes the computer to perform the method according to any of claims 1-15.