CN112800748B

CN112800748B - Phoneme prediction method, device, equipment and storage medium suitable for polyphones

Info

Publication number: CN112800748B
Application number: CN202110342957.4A
Authority: CN
Inventors: 苏雪琦; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2023-05-12
Anticipated expiration: 2041-03-30
Also published as: CN112800748A

Abstract

The application relates to the technical field of artificial intelligence and discloses a phoneme prediction method, device and equipment suitable for polyphones and a storage medium, wherein the method comprises the following steps: sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction are carried out on text data to be predicted to obtain preprocessed text data, then a phoneme prediction model is input to carry out phoneme prediction, the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM model training method, and the MLM model training method comprises the following steps: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding; and obtaining a target phoneme prediction result output by the phoneme prediction model. The coverage and intelligence of the phoneme prediction are improved, and the maintenance cost is reduced.

Description

Phoneme prediction method, device, equipment and storage medium suitable for polyphones

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for predicting phonemes for polyphones.

Background

The Grapheme-to-phonme (G2P) is one of the key links of the text front end of the chinese TTS (text to speech) system, and functions to convert chinese, which is pictographic, into phonemes that can be recognized by an acoustic model. Because of the condition of a large number of words and phones in Chinese characters, the problems of accurately giving correct phones to phones are solved by using the rule-based methods such as word list hit, regular expression matching and the like in the word-to-phone link. However, rule-based approaches have the following problems: (1) Insufficient coverage and lack of intelligence are often solved only by manually adding word lists or rules, and the problem of service response time difference is difficult to avoid; (2) The maintenance cost is high, and the rule is difficult to comprehensively cover the semantic pronunciation in the specific context.

Disclosure of Invention

The main purpose of the application is to provide a phoneme prediction method, a device, equipment and a storage medium suitable for polyphones, and aims to solve the problems of the prior art that correct phonemes are accurately given to polyphones, and the problems of insufficient coverage, lack of intelligence and high maintenance cost exist.

In order to achieve the above object, the present application proposes a phoneme prediction method suitable for polyphones, the method comprising:

Acquiring text data to be predicted;

performing sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data;

inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM model training method, and the MLM model training method comprises the following steps: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding;

and obtaining a target phoneme prediction result output by the phoneme prediction model.

Further, before the step of inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, the method further includes:

obtaining the plurality of training samples, wherein each training sample of the plurality of training samples comprises: text sample data and phoneme calibration data, wherein the text sample data comprises at least one polyphone;

dividing the plurality of training samples by adopting a preset dividing rule to obtain a first training sample set, a second training sample set and a third training sample set;

Performing multi-tone word mask training on an initial model by adopting the first training sample set, and obtaining the initial model after multi-tone word mask training after training, wherein the initial model is a model obtained based on the Bert model;

performing multi-tone word random pinyin substitution training on the initial model after multi-tone word mask training by adopting the second training sample set, and obtaining the initial model after multi-tone word random pinyin substitution training after training;

and training the initial model after the training is replaced by the multi-tone character random pinyin by adopting the third training sample set, and obtaining the phoneme prediction model after the training is finished.

Further, the step of obtaining the plurality of training samples includes:

acquiring a plurality of pieces of text data to be processed, wherein each piece of text data to be processed in the plurality of pieces of text data to be processed comprises at least one polyphone;

respectively carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on each text data to be processed in the plurality of text data to be processed to obtain the text sample data corresponding to each text data to be processed;

And respectively performing pinyin calibration on each text sample data corresponding to the plurality of text data to be processed to obtain phoneme calibration data corresponding to the text sample data corresponding to each text data to be processed, wherein the phoneme calibration data carries polyphone marks.

Further, the step of dividing the plurality of training samples by using a preset dividing rule to obtain a first training sample set, a second training sample set and a third training sample set includes:

and dividing the plurality of training samples by adopting a dividing rule of 8:1:1 to obtain the first training sample set, the second training sample set and the third training sample set.

Further, the step of performing the multi-tone word mask training on the initial model by using the first training sample set, and obtaining the initial model after the multi-tone word mask training after the training is finished includes:

obtaining a training sample from the first training sample set to obtain a first training sample;

inputting the text sample data of the first training sample into the initial model for single-word pinyin prediction based on the multi-word marks of the phoneme calibration data of the first training sample, masking single-word pinyin corresponding to multi-word and multi-word phoneme prediction to obtain a first sample phoneme prediction value;

Extracting a phoneme predicted value of a polyphone from the first sample phoneme predicted value based on the polyphone mark of the phoneme calibration data of the first training sample to obtain a first polyphone phoneme predicted value;

extracting a phoneme calibration value of a polyphone from the phoneme calibration data of the first training sample based on the polyphone marks of the phoneme calibration data of the first training sample to obtain a first polyphone phoneme calibration value;

inputting the first polyphone phoneme predicted value and the first polyphone phoneme calibration value into a first loss function for calculation to obtain a first loss value of the initial model, updating parameters of the initial model according to the first loss value, and using the updated initial model for calculating the first sample phoneme predicted value next time;

repeating the step of obtaining a training sample from the first training sample set until the first loss value reaches a first convergence condition or the number of iterations reaches the number of training samples in the first training sample set, and determining the initial model of which the first loss value reaches the first convergence condition or the number of iterations reaches the number of training samples in the first training sample set as the initial model after the multi-tone word mask training;

Wherein the first loss function employs a cross entropy loss function.

Further, the step of performing multi-tone word random pinyin substitution training on the initial model after multi-tone word mask training by using the second training sample set, and obtaining the initial model after multi-tone word random pinyin substitution training after training is completed includes:

obtaining a training sample from the second training sample set to obtain a second training sample;

inputting the text sample data of the second training sample into the initial model trained by the multi-tone word mask to perform single-word pinyin prediction based on multi-tone word marks of the phoneme calibration data of the second training sample, and replacing single-word pinyin corresponding to multi-tone words with pinyin randomly and performing multi-tone word phoneme prediction to obtain a second sample phoneme prediction value;

extracting a phoneme predicted value of a polyphone from the second sample phoneme predicted value based on the polyphone mark of the phoneme calibration data of the second training sample to obtain a second polyphone phoneme predicted value;

extracting a phoneme calibration value of a polyphone from the phoneme calibration data of the second training sample based on the polyphone marks of the phoneme calibration data of the second training sample to obtain a second polyphone phoneme calibration value;

Inputting the second polyphone phoneme predicted value and the second polyphone phoneme calibration value into a second loss function for calculation to obtain a second loss value of the initial model after the polyphone mask training, updating parameters of the initial model after the polyphone mask training according to the second loss value, and using the updated initial model after the polyphone mask training for calculating the second sample phoneme predicted value next time;

repeating the step of obtaining a training sample from the second training sample set to obtain a second training sample until the second loss value reaches a second convergence condition or the number of iterations reaches the number of training samples in the second training sample set, and determining an initial model after the multi-tone word mask training, in which the second loss value reaches the second convergence condition or the number of iterations reaches the number of training samples in the second training sample set, as the initial model after the multi-tone word random pinyin is substituted for training;

wherein the second loss function employs a cross entropy loss function.

Further, the step of training the initial model after training by using the third training sample set to replace the random pinyin of the polyphones, and obtaining the phoneme prediction model after training, includes:

Obtaining a training sample from the third training sample set to obtain a third training sample;

inputting the text sample data of the third training sample into the initial model after the multi-word random pinyin substitution training to conduct phoneme prediction to obtain a third sample phoneme prediction value;

extracting a phoneme predicted value of a polyphone from the phoneme predicted value of the third sample based on the polyphone label of the phoneme calibration data of the third training sample to obtain a third polyphone phoneme predicted value;

extracting a phoneme calibration value of a polyphone from the phoneme calibration data of the third training sample based on the polyphone marks of the phoneme calibration data of the third training sample to obtain a third polyphone phoneme calibration value;

inputting the third polyphone phoneme predicted value and the third polyphone phoneme calibration value into a third loss function for calculation to obtain a third loss value of the initial model after the polyphone random pinyin substitution training, updating parameters of the initial model after the polyphone random pinyin substitution training according to the third loss value, and using the updated initial model after the polyphone random pinyin substitution training for calculating the third sample phoneme predicted value next time;

Repeating the step of obtaining a training sample from the third training sample set to obtain a third training sample until the third loss value reaches a third convergence condition or the number of iterations reaches the number of training samples in the third training sample set, and determining the initial model after the multi-tone word random pinyin with the third loss value reaching the third convergence condition or the number of iterations reaching the number of training samples in the third training sample set to replace training as the phoneme prediction model;

wherein the third loss function employs a cross entropy loss function.

The application also proposes a phoneme prediction device suitable for polyphones, said device comprising:

the data acquisition module is used for acquiring text data to be predicted;

the preprocessing module is used for carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data;

the phoneme prediction module is configured to input the preprocessed text data into a phoneme prediction model for performing phoneme prediction, where the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones, and an MLM model training method, and the MLM model training method includes: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding;

And the target phoneme prediction result determining module is used for obtaining a target phoneme prediction result output by the phoneme prediction model.

The present application also proposes a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

The present application also proposes a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method of any of the above.

According to the phoneme prediction method, device, equipment and storage medium suitable for polyphone, the text data to be predicted is subjected to sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction to obtain the preprocessed text data, so that the accuracy of phoneme prediction is improved; and then inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM model training method, and the MLM model training method comprises the following steps: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding; therefore, by means of the excellent context semantic recognition capability of the Bert model, a word list or rules do not need to be manually added, and coverage and intelligence of polyphone phoneme prediction are improved; and training is carried out by adopting a plurality of training samples containing polyphones and an MLM model training method to obtain a phoneme prediction model, so that the prediction capability of phonemes of the polyphones with complex contexts is improved, the maintenance cost is reduced, and the coverage and accuracy of the polyphone phoneme prediction are improved.

Drawings

FIG. 1 is a flowchart illustrating a phoneme prediction method suitable for polyphones according to an embodiment of the present application;

FIG. 2 is a block diagram showing a structure of a phoneme prediction apparatus suitable for polyphones according to an embodiment of the present application;

fig. 3 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In order to solve the problems of insufficient coverage, lack of intelligence and high maintenance cost of the correct phoneme accurately given to the polyphones by a rule-based method in the prior art, the application provides a phoneme prediction method suitable for the polyphones, and the method is applied to the technical field of artificial intelligence. According to the phoneme prediction method suitable for the polyphones, sentence structure analysis, text regularization treatment, word segmentation treatment and part-of-speech prediction are sequentially carried out on text data to be predicted, then a phoneme prediction model is input for phoneme prediction, and the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing the polyphones and an MLM model training method, so that excellent contextual semantic recognition capability of the Bert model is utilized, a word list or rules do not need to be manually added, coverage and intelligence of polyphone phoneme prediction are improved, prediction capability of phonemes of the polyphones with complex contexts is improved, and maintenance cost is reduced.

Referring to fig. 1, in an embodiment of the present application, a phoneme prediction method suitable for polyphones is provided, where the method includes:

s1: acquiring text data to be predicted;

s2: performing sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data;

s3: inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM model training method, and the MLM model training method comprises the following steps: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding;

s4: and obtaining a target phoneme prediction result output by the phoneme prediction model.

In the embodiment, the text data to be predicted is firstly subjected to sentence structure analysis, text regularization treatment, word segmentation treatment and part-of-speech prediction to obtain the preprocessed text data, so that the accuracy of phoneme prediction is improved; and then inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM model training method, and the MLM model training method comprises the following steps: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding; therefore, by means of the excellent context semantic recognition capability of the Bert model, a word list or rules do not need to be manually added, and coverage and intelligence of polyphone phoneme prediction are improved; and training is carried out by adopting a plurality of training samples containing polyphones and an MLM model training method to obtain a phoneme prediction model, so that the prediction capability of phonemes of the polyphones with complex contexts is improved, the maintenance cost is reduced, and the coverage and accuracy of the polyphone phoneme prediction are improved.

For S1, text data to be predicted input by the user may be obtained, and text data to be predicted sent by the third party application system may also be obtained.

The text data to be predicted is a text for which phoneme prediction is required. The text data to be predicted includes Chinese characters.

S2, carrying out sentence structure analysis on the text data to be predicted to obtain the text data to be regularized; performing text regularization processing on the text data to be regularized to obtain text data to be segmented; performing word segmentation on the text data to be segmented to obtain segmented text data; and performing part-of-speech prediction on the segmented text data to obtain the preprocessed text data.

And sentence structure analysis, which is used for dividing the text data to be predicted into sentences. Alternatively, sentence structure analysis may be implemented using a neural network training based model.

And the text regularization process is used for converting punctuation or numbers which are not Chinese in the text data to be regularized into Chinese sub-expressions in the Chinese context. For example, text regularization is performed on the text data "6.5" to obtain text data "six-point-five", which is not specifically limited herein by way of example. Alternatively, the text regularization process may be implemented using a neural network training based model.

And the word segmentation process is used for segmenting sentences in the text data to be segmented according to semantics, and segmenting Chinese characters of one word together when segmenting. Alternatively, the word segmentation process may be implemented using a neural network training based model.

And the part-of-speech prediction is used for predicting the part-of-speech of each word in the segmented text data. The parts of speech include: nouns, verbs, adjectives, quantity words, pronouns, adverbs, prepositions, conjunctions, furnishes, interjections, and personions. Alternatively, part-of-speech prediction may be implemented using a neural network training based model.

For S3, the phonemes are the minimum phonetic units divided according to the natural attributes of the speech, and are analyzed according to the pronunciation actions in syllables, one action constituting one phoneme.

And inputting the preprocessed text data into a phoneme prediction model to perform phoneme prediction so as to realize the prediction of the phonemes of each Chinese character of the preprocessed text data. It will be appreciated that phoneme prediction is to predict the pinyin of chinese. For example, the chinese text "length, width, and height" is converted into the phoneme "chang2 kuan1 gao1", and the examples are not limited in detail herein.

Optionally, the Bert (Bidirectional Encoder Representation from Transformers) model adopted in the application selects the Bert model of Google open source, which is an end-to-end model of seq2seq, and can output predicted text after receiving text, so that the method has good context semantic understanding capability.

MLM, i.e. Masked Language Model.

The method comprises the steps of obtaining an initial model based on a Bert model, obtaining a plurality of training samples containing polyphones, respectively training the initial model by the obtained plurality of training samples through a polyphone mask training method, a polyphone random pinyin replacing training method, a non-mask training method and a non-shielding training method, and obtaining a phoneme prediction model after training is finished. Therefore, the phoneme prediction model has the advantage of having the excellent context semantic recognition capability of the Bert model, and a word list or rules do not need to be manually added, so that the coverage and the intelligence of the polyphonic phoneme prediction are improved; and training is carried out by adopting a plurality of training samples containing polyphones and an MLM model training method to obtain a phoneme prediction model, so that the prediction capability of phonemes of the polyphones with complex contexts is improved, the maintenance cost is reduced, and the coverage and accuracy of the polyphone phoneme prediction are improved.

And S4, obtaining a phoneme prediction result output by the phoneme prediction model, and taking the obtained phoneme prediction result as a target phoneme prediction result corresponding to the text data to be predicted.

After the step of obtaining the target phoneme prediction result output by the phoneme prediction model, the method further comprises the following steps:

s5: performing prosody prediction on the target phoneme prediction result to obtain a prosody prediction result;

s6: and inputting the prosody prediction result into an acoustic model of the TTS system.

It will be appreciated that steps S1 to S6 may be implemented as a text front end system of a TTS system.

The method for prosody prediction of the target phoneme prediction result may be selected from the prior art, and will not be described herein.

In one embodiment, before the step of inputting the preprocessed text data into the phoneme prediction model for phoneme prediction, the method further includes:

s021: obtaining the plurality of training samples, wherein each training sample of the plurality of training samples comprises: text sample data and phoneme calibration data, wherein the text sample data comprises at least one polyphone;

s022: dividing the plurality of training samples by adopting a preset dividing rule to obtain a first training sample set, a second training sample set and a third training sample set;

S023: performing multi-tone word mask training on an initial model by adopting the first training sample set, and obtaining the initial model after multi-tone word mask training after training, wherein the initial model is a model obtained based on the Bert model;

s024: performing multi-tone word random pinyin substitution training on the initial model after multi-tone word mask training by adopting the second training sample set, and obtaining the initial model after multi-tone word random pinyin substitution training after training;

s025: and training the initial model after the training is replaced by the multi-tone character random pinyin by adopting the third training sample set, and obtaining the phoneme prediction model after the training is finished.

According to the embodiment, a phoneme prediction model obtained by training based on the Bert model, a plurality of training samples containing polyphones and an MLM model training method is realized, so that the prediction capability of phonemes of polyphones with complex contexts is improved, the maintenance cost is reduced, and the coverage and accuracy of polyphone phoneme prediction are improved.

For S021, a plurality of training samples may be obtained from the database, a plurality of training samples input by the user may be obtained, or a plurality of training samples sent by the third party application system may be obtained.

Each training sample includes a text sample data and a phoneme calibration data.

In the same training sample, the phoneme calibration data is the calibration result of phonemes of each Chinese character in the text sample data. For example, the text sample data is "length, width, height", and the phoneme calibration data is "chang2#kuan1 gao", which is not particularly limited herein.

It will be appreciated that the phoneme calibration data carries polyphone labels. For example, the phoneme calibration data is "chang2#kuan1 gao", where # is a polyphone label, and the examples are not specifically limited herein.

For S022, a preset division rule is adopted to divide each training sample in the plurality of training samples, so as to obtain a first training sample set, a second training sample set and a third training sample set, that is, each training sample in the plurality of training samples can only be divided into any one of the first training sample set, the second training sample set and the third training sample set.

Preset partitioning rules include, but are not limited to: preset dividing ratio.

The preset dividing rule can be obtained from the database and stored in the database, the preset dividing rule can be obtained from a third party application system, and the preset dividing rule can be written into a program file for realizing the application.

And for S023, training samples in the first training sample set are sequentially adopted, the initial model is subjected to multi-tone word mask training, and the initial model after training is used as the initial model after multi-tone word mask training. The multi-tone word mask training is a training mode for masking single-word pinyin corresponding to multi-tone words in a phoneme prediction process.

And for S024, sequentially adopting training samples in the second training sample set, performing multi-tone word random pinyin substitution training on the initial model after multi-tone word mask training, and taking the initial model after multi-tone word mask training after training as the initial model after multi-tone word random pinyin substitution training. The multi-tone word random pinyin substitution training is a training mode of substituting pinyin for single-word pinyin corresponding to multi-tone words in a phoneme prediction process.

And for S025, training the initial model after training by sequentially adopting training samples in the third training sample set to replace the multi-tone word random pinyin, and taking the initial model after training by replacing the multi-tone word random pinyin after training as the phoneme prediction model, wherein masking and replacing processing are not required for single word pinyin corresponding to multi-tone words in the phoneme prediction process.

In one embodiment, the step of obtaining the plurality of training samples includes:

s0211: acquiring a plurality of pieces of text data to be processed, wherein each piece of text data to be processed in the plurality of pieces of text data to be processed comprises at least one polyphone;

s0212: respectively carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on each text data to be processed in the plurality of text data to be processed to obtain the text sample data corresponding to each text data to be processed;

s0213: and respectively performing pinyin calibration on each text sample data corresponding to the plurality of text data to be processed to obtain phoneme calibration data corresponding to the text sample data corresponding to each text data to be processed, wherein the phoneme calibration data carries polyphone marks.

The embodiment realizes that text sample data is obtained by sentence structure analysis, text regularization treatment, word segmentation treatment and part-of-speech prediction of the text data to be processed, thereby being beneficial to improving the accuracy of the phoneme prediction model obtained by training.

For S0211, a plurality of text data to be processed may be obtained from the database, a plurality of text data to be processed input by the user may be obtained, and a plurality of text data to be processed sent by the third party application system may be obtained.

For S0212, extracting one text data to be processed from the plurality of text data to be processed to obtain target text data to be processed; performing sentence structure analysis on the text data to be processed of the target to obtain sample data to be regularized; performing text regularization processing on the sample data to be regularized to obtain sample data to be segmented; performing word segmentation on the sample data to be segmented to obtain segmented sample data; performing part-of-speech prediction on the segmented sample data to obtain text sample data corresponding to the text data to be processed of the target; and repeatedly executing the step of extracting one piece of text data to be processed from the plurality of pieces of text data to be processed to obtain target text data to be processed until determining text sample data corresponding to each piece of text data to be processed in the plurality of pieces of text data to be processed.

For S0213, extracting one text sample data from the text sample data corresponding to each of the plurality of text data to be processed to obtain text sample data to be calibrated; performing pinyin calibration on the text sample data to be calibrated to obtain the phoneme calibration data corresponding to the text sample data to be calibrated; and repeatedly executing the step of extracting one text sample data from the text sample data corresponding to each of the plurality of text data to be processed to obtain the text sample data to be calibrated until determining the phoneme calibration data corresponding to the text sample data corresponding to each of the plurality of text data to be processed.

In an embodiment, the step of dividing the plurality of training samples by using a preset dividing rule to obtain a first training sample set, a second training sample set and a third training sample set includes:

The multiple training samples are divided according to the division rule of 8:1:1, so that multi-tone word mask training is conducted by using a large number of training samples, multi-tone word random pinyin replacement training is conducted by using a small number of training samples, non-mask and non-replacement training is conducted by using a small number of training samples, and robustness of a model obtained through training is improved.

Wherein 80% of the plurality of training samples are partitioned into a first training sample set, 10% of the plurality of training samples are partitioned into a second training sample set, and 10% of the plurality of training samples are partitioned into a third training sample set.

In one embodiment, the step of performing the multi-tone word mask training on the initial model by using the first training sample set, and obtaining the initial model after the multi-tone word mask training after the training is finished includes:

S0231: obtaining a training sample from the first training sample set to obtain a first training sample;

s0232: inputting the text sample data of the first training sample into the initial model for single-word pinyin prediction based on the multi-word marks of the phoneme calibration data of the first training sample, masking single-word pinyin corresponding to multi-word and multi-word phoneme prediction to obtain a first sample phoneme prediction value;

s0233: extracting a phoneme predicted value of a polyphone from the first sample phoneme predicted value based on the polyphone mark of the phoneme calibration data of the first training sample to obtain a first polyphone phoneme predicted value;

s0234: extracting a phoneme calibration value of a polyphone from the phoneme calibration data of the first training sample based on the polyphone marks of the phoneme calibration data of the first training sample to obtain a first polyphone phoneme calibration value;

s0235: inputting the first polyphone phoneme predicted value and the first polyphone phoneme calibration value into a first loss function for calculation to obtain a first loss value of the initial model, updating parameters of the initial model according to the first loss value, and using the updated initial model for calculating the first sample phoneme predicted value next time;

S0236: repeating the step of obtaining a training sample from the first training sample set until the first loss value reaches a first convergence condition or the number of iterations reaches the number of training samples in the first training sample set, and determining the initial model of which the first loss value reaches the first convergence condition or the number of iterations reaches the number of training samples in the first training sample set as the initial model after the multi-tone word mask training;

wherein the first loss function employs a cross entropy loss function.

According to the method and the device, the multi-tone word mask training is carried out on the initial model, so that the context semantic recognition capability of the model is improved, a word list or a rule does not need to be manually added, and the coverage and the intelligence of multi-tone word phoneme prediction are improved.

For S0231, sequentially obtaining a training sample from the first training sample set, and taking the obtained training sample as the first training sample.

For S0232, inputting the text sample data of the first training sample into the initial model to perform phoneme prediction according to the order of single word pinyin prediction, masking single word pinyin corresponding to polyphones, and polyphone phoneme prediction, wherein masking single word pinyin corresponding to polyphones is performed based on the polyphone marks of the phoneme calibration data of the first training sample. For example, the text sample data of the first training sample is "i have long hair", the "i have long hair" is subjected to single word pinyin prediction to obtain "[ CLS ] wo3you4 chang2 tou fa4 le5", the phoneme calibration data of the first training sample is "wo 3you4 zhang3# tou2 fa4 le5", and the single word pinyin corresponding to the multi-word is masked based on the multi-word label of the phoneme calibration data of the first training sample to obtain "[ CLS ] wo3you4[ mask ] tou fa4 le5", which is not particularly limited herein.

With S0233, based on the position data of the multi-phone word mark of the phone calibration data of the first training sample, a phone predicted value of a multi-phone word is extracted from the first sample phone predicted value, and the extracted data is used as a first multi-phone predicted value. For example, the first sample phoneme predicted value is "[ CLS ] wo3 you4 zhang3 tou fa4 le5", the phoneme calibration data of the first training sample is "wo3 you4 zhang3# tou fa4 le5", the position data of the multi-tone word mark of the phoneme calibration data of the first training sample is the 3 rd word, and at this time, "zhang3" in "[ CLS ] wo3 you4 zhang3 tou fa4 le5" may be extracted as the first multi-tone word phoneme predicted value, which is not limited herein.

For S0234, extracting a phoneme calibration value of a polyphone from the phoneme calibration data of the first training sample based on the polyphone label of the phoneme calibration data of the first training sample, and taking the extracted data as a first polyphone phoneme calibration value. For example, the phoneme calibration data of the first training sample is "wo3 you4 zhang3# tou fa4 le5", where "#" is a polyphone label, and the phoneme "zhang3" of the polyphone label is used as the first polyphone phoneme calibration value, which is not specifically limited herein.

For S0235, when the first loss function adopts the cross entropy loss function, a method for inputting the first multi-phone phoneme predicted value and the first multi-phone phoneme calibration value into the first loss function for calculation may be selected from the prior art, and will not be described herein.

For S0236, steps S0231 to S0236 are repeatedly performed until the first loss value reaches a first convergence condition or the number of iterations reaches the number of training samples in the first training sample set.

The first convergence condition means that the magnitude of the first loss value calculated in two adjacent times satisfies the lipschitz condition (lipschitz continuous condition).

The iteration number refers to the number of times the initial model is used to calculate the first sample phoneme predicted value, that is, the first sample phoneme predicted value is calculated once, and the iteration number is increased by 1.

In one embodiment, the step of performing multi-tone word random pinyin substitution training on the initial model after multi-tone word mask training by using the second training sample set, and obtaining the initial model after multi-tone word random pinyin substitution training after training is completed includes:

s0241: obtaining a training sample from the second training sample set to obtain a second training sample;

S0242: inputting the text sample data of the second training sample into the initial model trained by the multi-tone word mask to perform single-word pinyin prediction based on multi-tone word marks of the phoneme calibration data of the second training sample, and replacing single-word pinyin corresponding to multi-tone words with pinyin randomly and performing multi-tone word phoneme prediction to obtain a second sample phoneme prediction value;

s0243: extracting a phoneme predicted value of a polyphone from the second sample phoneme predicted value based on the polyphone mark of the phoneme calibration data of the second training sample to obtain a second polyphone phoneme predicted value;

s0244: extracting a phoneme calibration value of a polyphone from the phoneme calibration data of the second training sample based on the polyphone marks of the phoneme calibration data of the second training sample to obtain a second polyphone phoneme calibration value;

s0245: inputting the second polyphone phoneme predicted value and the second polyphone phoneme calibration value into a second loss function for calculation to obtain a second loss value of the initial model after the polyphone mask training, updating parameters of the initial model after the polyphone mask training according to the second loss value, and using the updated initial model after the polyphone mask training for calculating the second sample phoneme predicted value next time;

S0246: repeating the step of obtaining a training sample from the second training sample set to obtain a second training sample until the second loss value reaches a second convergence condition or the number of iterations reaches the number of training samples in the second training sample set, and determining an initial model after the multi-tone word mask training, in which the second loss value reaches the second convergence condition or the number of iterations reaches the number of training samples in the second training sample set, as the initial model after the multi-tone word random pinyin is substituted for training;

wherein the second loss function employs a cross entropy loss function.

The embodiment realizes the multi-tone character random pinyin substitution training of the initial model after the multi-tone character mask training, thereby improving the capability of the model for processing the abnormality and improving the robustness of the model.

For S0241, sequentially obtaining a training sample from the second training sample set, and taking the obtained training sample as a second training sample.

For S0242, inputting the text sample data of the second training sample into the initial model trained by the multi-tone word mask to perform single-word pinyin prediction, replacing single-word pinyin corresponding to multi-tone words with pinyin at random, and performing phoneme prediction in the order of multi-tone word phoneme prediction, wherein the single-word pinyin corresponding to multi-tone words is replaced with pinyin at random based on multi-tone word marks of the phoneme calibration data of the second training sample. For example, the text sample data of the second training sample is "length, width, height", and "CLS" zhang3 kuan1 gao1 "is obtained by performing single-word pinyin prediction on" length, width, height ", the phoneme calibration data of the second training sample is" chang2#kuan1 gao1", and" CLS "jianan 3 kuan1 gao1" is obtained by masking a single-word pinyin corresponding to a polyphone based on a polyphone label of the phoneme calibration data of the second training sample, that is, the pinyin "jianan 3" is used to replace "chang2", which is not limited herein.

For S0243, based on the position data of the multi-phone word mark of the phoneme calibration data of the second training sample, a phoneme prediction value of a multi-phone word is extracted from the second sample phoneme prediction value, and the extracted data is used as a second multi-phone word phoneme prediction value. For example, the second sample phoneme predicted value is "[ CLS ] chang2 kuan1 gao1", the phoneme calibration data of the second training sample is "chang2#kuan1 gao1", and the position data of the polyphone mark of the phoneme calibration data of the second training sample is the 1 st word, and in this case, "chang2" in "[ CLS ] chang2 kuan1 gao" in the second sample phoneme predicted value may be extracted as the second polyphone phoneme predicted value, which is not specifically limited herein.

For S0244, extracting a phoneme calibration value of a polyphone from the phoneme calibration data of the second training sample based on the polyphone label of the phoneme calibration data of the second training sample, and taking the extracted data as a second polyphone phoneme calibration value. For example, the phoneme calibration data of the second training sample is "chang2#kuan1 gao", where "#" is a polyphone label, and the phoneme "chang2" of the polyphone label is used as the second polyphone phoneme calibration value, which is not specifically limited herein.

For S0245, when the second loss function adopts the cross entropy loss function, the method for inputting the second multi-phone phoneme predicted value and the second multi-phone phoneme calibration value into the second loss function for calculation may be selected from the prior art, which is not described herein.

For S0246, steps S0241 to S0246 are repeatedly performed until the second loss value reaches a second convergence condition or the number of iterations reaches the number of training samples in the second training sample set.

The second convergence condition means that the magnitude of the second loss value calculated in two adjacent times satisfies the lipschitz condition (lipschitz continuous condition).

The number of iterations up to the number of training samples in the second training sample set refers to the number of iterations that the initial model after the multi-tone word mask training is used to calculate the second sample phoneme predicted value, that is, the second sample phoneme predicted value is calculated once, and the number of iterations is increased by 1.

In one embodiment, the step of training the initial model after the training is replaced by the random pinyin of the polyphones by using the third training sample set, and obtaining the phoneme prediction model after the training is finished includes:

S0251: obtaining a training sample from the third training sample set to obtain a third training sample;

s0252: inputting the text sample data of the third training sample into the initial model after the multi-word random pinyin substitution training to conduct phoneme prediction to obtain a third sample phoneme prediction value;

s0253: extracting a phoneme predicted value of a polyphone from the phoneme predicted value of the third sample based on the polyphone label of the phoneme calibration data of the third training sample to obtain a third polyphone phoneme predicted value;

s0254: extracting a phoneme calibration value of a polyphone from the phoneme calibration data of the third training sample based on the polyphone marks of the phoneme calibration data of the third training sample to obtain a third polyphone phoneme calibration value;

s0255: inputting the third polyphone phoneme predicted value and the third polyphone phoneme calibration value into a third loss function for calculation to obtain a third loss value of the initial model after the polyphone random pinyin substitution training, updating parameters of the initial model after the polyphone random pinyin substitution training according to the third loss value, and using the updated initial model after the polyphone random pinyin substitution training for calculating the third sample phoneme predicted value next time;

S0256: repeating the step of obtaining a training sample from the third training sample set to obtain a third training sample until the third loss value reaches a third convergence condition or the number of iterations reaches the number of training samples in the third training sample set, and determining the initial model after the multi-tone word random pinyin with the third loss value reaching the third convergence condition or the number of iterations reaching the number of training samples in the third training sample set to replace training as the phoneme prediction model;

wherein the third loss function employs a cross entropy loss function.

According to the method and the device for the multi-tone word random pinyin, training of the initial model after the multi-tone word random pinyin is replaced by training is achieved, masking and replacing processing are not needed in the single-word pinyin corresponding to the multi-tone word in the phoneme prediction process, and therefore accuracy of model prediction is improved.

For S0251, sequentially obtaining a training sample from the third training sample set, and taking the obtained training sample as a third training sample.

And for S0252, inputting the text sample data of the third training sample into the initial model after the multi-word random pinyin substitution training to conduct phoneme prediction. For example, the text sample data of the third training sample is "long and high", and the single word pinyin prediction is performed on "long and high" to obtain "[ CLS ] zhang3 gao le5", and the phoneme calibration data of the third training sample is "zhang3# gao le5", which is not specifically limited herein.

For S0253, extracting a phoneme predicted value of a polyphone from the third sample phoneme predicted value based on the position data of the polyphone mark of the phoneme calibration data of the third training sample, and taking the extracted data as a third polyphone phoneme predicted value. For example, the third sample phoneme predicted value is "[ CLS ] zhang3 gao le5", the phoneme calibration data of the third training sample is "zhang3# gao le5", the position data of the multi-word mark of the phoneme calibration data of the third training sample is the 1 st word, and in this case, "zhang3" in "[ CLS ] zhang3 gao le5" may be extracted as the third multi-word phoneme predicted value, which is not specifically limited herein.

For S0254, extracting a phoneme calibration value of a polyphone from the phoneme calibration data of the third training sample based on the polyphone label of the phoneme calibration data of the third training sample, and taking the extracted data as a third polyphone phoneme calibration value. For example, the phoneme calibration data of the third training sample is "zhang3# gao le5", where "#" is a multi-word label, and the phoneme "zhang3" of the multi-word label is used as the third multi-word phoneme calibration value, which is not specifically limited herein.

For S0255, when the cross entropy loss function is adopted for the third loss function, a method for inputting the third multi-phone phoneme predicted value and the third multi-phone phoneme calibration value into the third loss function for calculation may be selected from the prior art, and will not be described herein.

For S0256, steps S0251 to S0256 are repeatedly performed until the third loss value reaches a third convergence condition or the number of iterations reaches the number of training samples in the third training sample set.

The third convergence condition means that the magnitude of the third loss value calculated in two adjacent times satisfies the lipschitz condition (lipschitz continuous condition).

The iteration number reaching the number of training samples in the third training sample set refers to the number of times that the initial model after the multi-word random pinyin substitution training is used to calculate the third sample phoneme predicted value, that is, calculate the third sample phoneme predicted value once, and the iteration number is increased by 1.

Referring to fig. 2, the present application further proposes a phoneme prediction apparatus suitable for polyphones, the apparatus comprising:

a data acquisition module 100, configured to acquire text data to be predicted;

The preprocessing module 200 is configured to perform sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted, so as to obtain preprocessed text data;

the phoneme prediction module 300 is configured to input the preprocessed text data into a phoneme prediction model for performing phoneme prediction, where the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones, and an MLM model training method, and the MLM model training method includes: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding;

and the target phoneme prediction result determining module 400 is configured to obtain a target phoneme prediction result output by the phoneme prediction model.

Referring to fig. 3, a computer device is further provided in the embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as a phoneme prediction method applicable to polyphones. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a phoneme prediction method for polyphones. The phoneme prediction method suitable for the polyphones comprises the following steps: acquiring text data to be predicted; performing sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data; inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM model training method, and the MLM model training method comprises the following steps: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding; and obtaining a target phoneme prediction result output by the phoneme prediction model.

An embodiment of the present application further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a phoneme prediction method suitable for polyphones, comprising the steps of: acquiring text data to be predicted; performing sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data; inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM model training method, and the MLM model training method comprises the following steps: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding; and obtaining a target phoneme prediction result output by the phoneme prediction model.

The phoneme prediction method suitable for the polyphone is implemented by firstly carrying out sentence structure analysis, text regularization treatment, word segmentation treatment and part-of-speech prediction on the text data to be predicted to obtain the preprocessed text data, so that the accuracy of phoneme prediction is improved; and then inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM model training method, and the MLM model training method comprises the following steps: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding; therefore, by means of the excellent context semantic recognition capability of the Bert model, a word list or rules do not need to be manually added, and coverage and intelligence of polyphone phoneme prediction are improved; and training is carried out by adopting a plurality of training samples containing polyphones and an MLM model training method to obtain a phoneme prediction model, so that the prediction capability of phonemes of the polyphones with complex contexts is improved, the maintenance cost is reduced, and the coverage and accuracy of the polyphone phoneme prediction are improved.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. A method of phoneme prediction for a polyphone, the method comprising:

acquiring text data to be predicted;

performing sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data; the text regularization processing comprises the steps of converting punctuation or numbers which are not Chinese in the text data into Chinese character expression;

Inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM model training method, and the MLM model training method comprises the following steps: training method of multi-tone word mask, training method of multi-tone word random phonetic substitution, training method of non-mask and non-shielding; wherein each training sample of the plurality of training samples comprises: text sample data and phoneme calibration data, wherein the text sample data comprises at least one polyphone; the phoneme calibration data carries a polyphone mark;

obtaining a target phoneme prediction result output by the phoneme prediction model;

before the step of inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, the method further comprises the following steps:

2. A phoneme prediction method as claimed in claim 1, wherein the step of obtaining the plurality of training samples comprises:

3. The method for predicting phonemes for a polyphone of claim 1, wherein the step of dividing the plurality of training samples using a preset division rule to obtain a first training sample set, a second training sample set, and a third training sample set includes:

4. The method for predicting phonemes for a polyphone of claim 1, wherein said step of performing polyphone mask training on the initial model using said first set of training samples, and ending the training to obtain a polyphone mask trained initial model includes:

Wherein the first loss function employs a cross entropy loss function.

5. The method for predicting phonemes for a polyphone of claim 1, wherein said step of performing a polyphone random pinyin substitution training on said initial model after said polyphone mask training using said second set of training samples, and ending the training to obtain a polyphone random pinyin substitution trained initial model, comprises:

wherein the second loss function employs a cross entropy loss function.

6. The method for predicting phonemes for a polyphone of claim 1, wherein said training the initial model after the training is replaced with the random pinyin for the polyphone using the third training sample set, and the step of obtaining the phoneme prediction model after the training is completed includes:

wherein the third loss function employs a cross entropy loss function.

7. A phoneme prediction apparatus adapted for use with polyphones, the apparatus comprising:

the data acquisition module is used for acquiring text data to be predicted;

The target phoneme prediction result determining module is used for obtaining a target phoneme prediction result output by the phoneme prediction model;

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.