CN112800748A

CN112800748A - Phoneme prediction method, device and equipment suitable for polyphone and storage medium

Info

Publication number: CN112800748A
Application number: CN202110342957.4A
Authority: CN
Inventors: 苏雪琦; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-05-14
Anticipated expiration: 2041-03-30
Also published as: CN112800748B

Abstract

The application relates to the technical field of artificial intelligence, and discloses a phoneme prediction method, a device, equipment and a storage medium suitable for polyphones, wherein the method comprises the following steps: the method comprises the steps of carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on text data to be predicted to obtain preprocessed text data, then inputting a phoneme prediction model to carry out phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM model training method, and the MLM model training method comprises the following steps: a polyphone mask training method, a polyphone random pinyin substitution training method and a non-mask and non-shielding training method; and acquiring a target phoneme prediction result output by the phoneme prediction model. The coverage and intelligence of phoneme prediction are improved, and the maintenance cost is reduced.

Description

Phoneme prediction method, device and equipment suitable for polyphone and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for predicting phonemes suitable for polyphones.

Background

The font-to-Phoneme (G2P) is one of the key links of the front end of text of a chinese TTS (text-to-speech) system, and functions to convert chinese as a pictograph into a Phoneme recognizable by an acoustic model. Because a large number of Chinese characters with one word and multiple tones exist, a rule-based method such as word list hit, regular expression matching and the like is commonly used in the process of converting characters into phonemes to solve the problem of accurately endowing correct phonemes for polyphones. However, the rule-based method has the following problems: (1) the problems of insufficient coverage and lack of intelligence can be solved only by manually adding word lists or rules, and the problem of service response time difference is difficult to avoid; (2) maintenance costs are high and it is difficult for rules to fully cover semantic readings in a particular context.

Disclosure of Invention

The application mainly aims to provide a phoneme prediction method, a device, equipment and a storage medium suitable for polyphone characters, and aims to solve the technical problems that a rule-based method in the prior art accurately endows correct phonemes for polyphone characters, and the problems of insufficient coverage, lack of intelligence and high maintenance cost exist.

In order to achieve the above object, the present application proposes a phoneme prediction method suitable for polyphones, the method comprising:

acquiring text data to be predicted;

carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data;

inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM (multi-level modeling) model training method, and the MLM model training method comprises the following steps: a polyphone mask training method, a polyphone random pinyin substitution training method and a non-mask and non-shielding training method;

and acquiring a target phoneme prediction result output by the phoneme prediction model.

Further, before the step of inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, the method further includes:

obtaining the plurality of training samples, wherein each training sample of the plurality of training samples comprises: text sample data and phoneme calibration data, wherein the text sample data comprises at least one polyphone;

dividing the training samples by adopting a preset dividing rule to obtain a first training sample set, a second training sample set and a third training sample set;

performing polyphonic mask training on an initial model by using the first training sample set, and obtaining the initial model after polyphonic mask training after the training is finished, wherein the initial model is a model obtained based on the Bert model;

performing polyphone random pinyin substitution training on the initial model after the polyphone mask training by adopting the second training sample set, and obtaining the initial model after the polyphone random pinyin substitution training after the training is finished;

and training the initial model after the polyphone random pinyin substitution training by adopting the third training sample set, and obtaining the phoneme prediction model after the training is finished.

Further, the step of obtaining the plurality of training samples includes:

acquiring a plurality of text data to be processed, wherein each text data to be processed in the plurality of text data to be processed comprises at least one polyphone;

respectively carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on each text data to be processed in the plurality of text data to be processed to obtain the text sample data corresponding to the plurality of text data to be processed;

and performing pinyin calibration on each text sample data corresponding to the plurality of text data to be processed respectively to obtain the phoneme calibration data corresponding to the text sample data corresponding to each text data to be processed, wherein the phoneme calibration data carries a multi-phoneme mark.

Further, the step of dividing the training samples by using a preset dividing rule to obtain a first training sample set, a second training sample set and a third training sample set includes:

and dividing the plurality of training samples by adopting a division rule of 8:1:1 to obtain the first training sample set, the second training sample set and the third training sample set.

Further, the step of performing polyphonic mask training on the initial model by using the first training sample set, and obtaining the initial model after the polyphonic mask training after the training is finished includes:

obtaining a training sample from the first training sample set to obtain a first training sample;

inputting the text sample data of the first training sample into the initial model to perform single character pinyin prediction, and performing mask and polyphone phoneme prediction on single character pinyin corresponding to polyphone based on polyphone marks of the phoneme calibration data of the first training sample to obtain a first sample phoneme prediction value;

extracting a phoneme predicted value of a polyphone from the first sample phoneme predicted value based on the polyphone label of the phoneme calibration data of the first training sample to obtain a first polyphone phoneme predicted value;

extracting phoneme calibration values of polyphones from the phoneme calibration data of the first training sample based on the polyphone labels of the phoneme calibration data of the first training sample to obtain first polyphone phoneme calibration values;

inputting the first polyphone phoneme predicted value and the first polyphone phoneme calibration value into a first loss function for calculation to obtain a first loss value of the initial model, updating parameters of the initial model according to the first loss value, and using the updated initial model for calculating the first sample phoneme predicted value next time;

repeatedly executing the step of obtaining a training sample from the first training sample set to obtain a first training sample until the first loss value reaches a first convergence condition or the number of iterations reaches the number of the training samples in the first training sample set, and determining the initial model with the first loss value reaching the first convergence condition or the number of iterations reaching the number of the training samples in the first training sample set as the initial model after the polyphonic mask training;

wherein the first loss function is a cross entropy loss function.

Further, the step of performing polyphone random pinyin alternative training on the initial model after the polyphone mask training by using the second training sample set, and obtaining the polyphone random pinyin alternative training on the initial model after the training is finished includes:

obtaining a training sample from the second training sample set to obtain a second training sample;

inputting the text sample data of the second training sample into the initial model after the polyphonic mask training for single character pinyin prediction, and randomly replacing single character pinyin corresponding to polyphonic characters by pinyin and predicting polyphonic characters based on polyphonic character marks of the phoneme calibration data of the second training sample to obtain a second sample phoneme prediction value;

extracting a phoneme predicted value of a polyphone from the phoneme predicted value of the second sample based on the polyphone label of the phoneme calibration data of the second training sample to obtain a phoneme predicted value of the second polyphone;

extracting phoneme calibration values of polyphones from the phoneme calibration data of the second training sample based on the polyphone labels of the phoneme calibration data of the second training sample to obtain second polyphone phoneme calibration values;

inputting the second polyphonic phoneme predicted value and the second polyphonic phoneme calibration value into a second loss function for calculation to obtain a second loss value of the initial model after the polyphonic mask training, updating parameters of the initial model after the polyphonic mask training according to the second loss value, and using the updated initial model after the polyphonic mask training for calculating the second sample phoneme predicted value next time;

repeatedly executing the step of obtaining a training sample from the second training sample set to obtain a second training sample until the second loss value reaches a second convergence condition or the number of iterations reaches the number of the training samples in the second training sample set, and determining the initial model after the polyphone mask training, in which the second loss value reaches the second convergence condition or the number of iterations reaches the number of the training samples in the second training sample set, as the initial model after the polyphone random pinyin substitution training;

wherein the second loss function is a cross-entropy loss function.

Further, the step of training the initial model after the random pinyin substitution training of the polyphones by using the third training sample set and obtaining the phoneme prediction model after the training is finished includes:

obtaining a training sample from the third training sample set to obtain a third training sample;

inputting the text sample data of the third training sample into the initial model after the polyphone random pinyin alternative training for phoneme prediction to obtain a third sample phoneme prediction value;

extracting a phoneme predicted value of a polyphone from the phoneme predicted value of the third sample based on a polyphone label of the phoneme calibration data of the third training sample to obtain a third polyphone phoneme predicted value;

extracting phoneme calibration values of polyphones from the phoneme calibration data of the third training sample based on the polyphone labels of the phoneme calibration data of the third training sample to obtain third polyphone phoneme calibration values;

inputting the third polyphone phoneme predicted value and the third polyphone phoneme calibration value into a third loss function for calculation to obtain a third loss value of the initial model after the polyphone random pinyin substitution training, updating the parameter of the initial model after the polyphone random pinyin substitution training according to the third loss value, and using the updated initial model after the polyphone random pinyin substitution training for calculating the third sample phoneme predicted value next time;

repeatedly executing the step of obtaining a training sample from the third training sample set to obtain a third training sample until the third loss value reaches a third convergence condition or the number of iterations reaches the number of the training samples in the third training sample set, and determining the initial model after the training as the phoneme prediction model by replacing the polyphone random pinyin with the polyphone random pinyin, wherein the polyphone random pinyin with the third loss value reaches the third convergence condition or the number of iterations reaches the number of the training samples in the third training sample set;

wherein the third loss function is a cross-entropy loss function.

The present application also proposes a phoneme prediction apparatus suitable for polyphones, the apparatus comprising:

the data acquisition module is used for acquiring text data to be predicted;

the preprocessing module is used for carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data;

a phoneme prediction module, configured to input the preprocessed text data into a phoneme prediction model for phoneme prediction, where the phoneme prediction model is a model obtained by training based on a Bert model, multiple training samples containing polyphones, and an MLM model training method, and the MLM model training method includes: a polyphone mask training method, a polyphone random pinyin substitution training method and a non-mask and non-shielding training method;

and the target phoneme prediction result determining module is used for acquiring a target phoneme prediction result output by the phoneme prediction model.

The present application further proposes a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.

According to the phoneme prediction method, the device, the equipment and the storage medium suitable for the polyphones, the text data to be predicted are subjected to sentence structure analysis, text regularization processing, word segmentation processing and part of speech prediction to obtain preprocessed text data, so that the accuracy of phoneme prediction is improved; inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM (multi-level modeling) model training method, and the MLM model training method comprises the following steps: a polyphone mask training method, a polyphone random pinyin substitution training method and a non-mask and non-shielding training method; therefore, by means of the excellent context semantic recognition capability of the Bert model, a word list or a rule does not need to be manually added, and the coverage and intelligence of polyphone phoneme prediction are improved; and a plurality of training samples containing polyphones and an MLM model training method are adopted for training to obtain a phoneme prediction model, so that the prediction capability of phonemes of polyphones in a complex context is improved, the maintenance cost is reduced, and the coverage and accuracy of polyphone phoneme prediction are improved.

Drawings

FIG. 1 is a flowchart illustrating a phoneme prediction method for polyphones according to an embodiment of the present application;

FIG. 2 is a block diagram illustrating a structure of a phoneme prediction apparatus for polyphones according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The objectives, features, and advantages of the present application will be further described with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In order to solve the technical problems of insufficient coverage, lack of intelligence and high maintenance cost of a method based on rules in the prior art for accurately endowing correct phonemes for polyphones, the phoneme prediction method applicable to polyphones is provided in the application, and the method is applied to the technical field of artificial intelligence. The phoneme prediction method suitable for the polyphones comprises the steps of carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on text data needing to be predicted in sequence, and then inputting a phoneme prediction model to carry out phoneme prediction, wherein the phoneme prediction model is obtained by training based on a Bert model, a plurality of training samples containing the polyphones and an MLM model training method, so that the covering surface and intelligence of polyphones phoneme prediction are improved by means of the excellent context semantic recognition capability of the Bert model without manually adding word lists or rules, the prediction capability of the phonemes of the polyphones in complex context is improved, and the maintenance cost is reduced.

Referring to fig. 1, an embodiment of the present application provides a phoneme prediction method applicable to polyphones, where the method includes:

s1: acquiring text data to be predicted;

s2: carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data;

s3: inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM (multi-level modeling) model training method, and the MLM model training method comprises the following steps: a polyphone mask training method, a polyphone random pinyin substitution training method and a non-mask and non-shielding training method;

s4: and acquiring a target phoneme prediction result output by the phoneme prediction model.

In the embodiment, the preprocessed text data is obtained by performing sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted, so that the accuracy of phoneme prediction is improved; inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM (multi-level modeling) model training method, and the MLM model training method comprises the following steps: a polyphone mask training method, a polyphone random pinyin substitution training method and a non-mask and non-shielding training method; therefore, by means of the excellent context semantic recognition capability of the Bert model, a word list or a rule does not need to be manually added, and the coverage and intelligence of polyphone phoneme prediction are improved; and a plurality of training samples containing polyphones and an MLM model training method are adopted for training to obtain a phoneme prediction model, so that the prediction capability of phonemes of polyphones in a complex context is improved, the maintenance cost is reduced, and the coverage and accuracy of polyphone phoneme prediction are improved.

For S1, the text data to be predicted input by the user may be acquired, and the text data to be predicted sent by the third-party application system may also be acquired.

The text data to be predicted is a text which needs phoneme prediction. The text data to be predicted includes Chinese characters.

For S2, sentence structure analysis is carried out on the text data to be predicted to obtain text data to be regularized; performing text regularization processing on the text data to be regularized to obtain text data to be segmented; performing word segmentation processing on the text data to be segmented to obtain segmented text data; and performing part-of-speech prediction on the text data after word segmentation to obtain the preprocessed text data.

And sentence structure analysis for dividing the text data to be predicted into sentences. Alternatively, sentence structure analysis can be implemented by using a model based on neural network training.

And the text regularization processing is used for converting punctuations or numbers which are not Chinese in the text data to be regularized into Chinese expressions in a Chinese context. For example, the text regularization processing is performed on the text data "6.5" to obtain the text data "six dots and five", which is not specifically limited in this example. Optionally, the text regularization process may be implemented by a model based on neural network training.

And the word segmentation processing is used for segmenting sentences in the text data to be segmented according to semantics and segmenting Chinese characters of one word together during segmentation. Optionally, the word segmentation process may be implemented by using a model based on neural network training.

And the part-of-speech prediction is used for predicting the part-of-speech of each word in the text data after word segmentation. The parts of speech include: nouns, verbs, adjectives, quantifiers, pronouns, adverbs, prepositions, conjunctions, adjectives, sighs, and vocabularies. Optionally, the part-of-speech prediction may be implemented by using a model based on neural network training.

For S3, the phoneme is the smallest phonetic unit divided according to the natural attributes of the speech, and is analyzed according to the pronunciation action in the syllable, and one action constitutes one phoneme.

Inputting the preprocessed text data into a phoneme prediction model to perform phoneme prediction so as to realize the prediction of phonemes of each Chinese character of the preprocessed text data. It is understood that phoneme prediction is the prediction of pinyin in chinese. For example, the chinese text "length, width, and height" is converted into the phoneme "chang 2 kuan1 gao 1", which is not specifically limited by the example.

Optionally, the Bert (bidirectional Encoder reproduction from transforms) model adopted in the present application selects a Bert model of Google open source, which is an end-to-end model of seq2seq, and can implement outputting a predicted text after receiving the text, and has a good context semantic understanding capability.

MLM, i.e., Masked Languge Model.

The initial model is obtained based on the Bert model, then a plurality of training samples containing polyphones are obtained, the initial model is respectively trained on the obtained training samples by adopting a polyphone mask training method, a polyphone random pinyin replacing training method and a non-mask and non-masking training method, and the phoneme prediction model is obtained after training is finished. Therefore, the phoneme prediction model has the excellent context semantic recognition capability of the Bert model, a word list or a rule does not need to be manually added, and the coverage and intelligence of polyphone phoneme prediction are improved; and a plurality of training samples containing polyphones and an MLM model training method are adopted for training to obtain a phoneme prediction model, so that the prediction capability of phonemes of polyphones in a complex context is improved, the maintenance cost is reduced, and the coverage and accuracy of polyphone phoneme prediction are improved.

And S4, acquiring a phoneme prediction result output by the phoneme prediction model, and taking the obtained phoneme prediction result as a target phoneme prediction result corresponding to the text data to be predicted.

After the step of obtaining the target phoneme prediction result output by the phoneme prediction model, the method further includes:

s5: carrying out prosody prediction on the target phoneme prediction result to obtain a prosody prediction result;

s6: and inputting the prosody prediction result into an acoustic model of the TTS system.

It is understood that steps S1 through S6 may be implemented as a text front-end system of a TTS system.

The method for performing prosody prediction on the target phoneme prediction result may be selected from the prior art, and is not described herein again.

In an embodiment, before the step of inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, the method further includes:

s021: obtaining the plurality of training samples, wherein each training sample of the plurality of training samples comprises: text sample data and phoneme calibration data, wherein the text sample data comprises at least one polyphone;

s022: dividing the training samples by adopting a preset dividing rule to obtain a first training sample set, a second training sample set and a third training sample set;

s023: performing polyphonic mask training on an initial model by using the first training sample set, and obtaining the initial model after polyphonic mask training after the training is finished, wherein the initial model is a model obtained based on the Bert model;

s024: performing polyphone random pinyin substitution training on the initial model after the polyphone mask training by adopting the second training sample set, and obtaining the initial model after the polyphone random pinyin substitution training after the training is finished;

s025: and training the initial model after the polyphone random pinyin substitution training by adopting the third training sample set, and obtaining the phoneme prediction model after the training is finished.

The embodiment realizes the phoneme prediction model obtained by training based on the Bert model, a plurality of training samples containing polyphones and the MLM model training method, thereby improving the prediction capability of phonemes of polyphones in a complex context, reducing the maintenance cost and improving the coverage and accuracy of polyphone phoneme prediction.

For S021, a plurality of training samples may be obtained from the database, a plurality of training samples input by the user may also be obtained, and a plurality of training samples sent by the third-party application system may also be obtained.

Each training sample includes a text sample data and a phoneme label data.

In the same training sample, the phoneme calibration data is the calibration result of the phoneme of each Chinese character in the text sample data. For example, the text sample data is "length, width and height", and the phoneme calibration data is "chang 2# kuan1 gao 1", which is not specifically limited by the example herein.

It will be appreciated that the phoneme notation data carries a polyphonic notation. For example, the phoneme notation data is "chang 2# kuan1 gao 1", where # is a polyphonic notation, which is not specifically limited by the example herein.

For S022, a preset partition rule is adopted to partition each training sample in the plurality of training samples to obtain a first training sample set, a second training sample set, and a third training sample set, that is, each training sample in the plurality of training samples can only be partitioned into any one of the first training sample set, the second training sample set, and the third training sample set.

The preset partitioning rule includes but is not limited to: and (4) a preset division ratio.

The preset division rule can be acquired from a database and stored in the database, the preset division rule can also be acquired from a third-party application system, and the preset division rule can also be written into a program file for realizing the application.

And S023, sequentially adopting the training samples in the first training sample set, performing polyphonic mask training on the initial model, and taking the initial model after training as the initial model after polyphonic mask training. The polyphone mask training is a training mode of performing mask on single-character pinyin corresponding to polyphones in the phoneme prediction process.

And for S024, sequentially adopting the training samples in the second training sample set, carrying out polyphone random pinyin substitution training on the initial model after the polyphone mask training, and taking the initial model after the polyphone mask training after the training as the polyphone random pinyin substitution training. The polyphone random pinyin substitution training is a training mode of randomly substituting pinyin for the pinyin of a single character corresponding to the polyphone in the phoneme prediction process.

And for S025, training the initial model after the polyphone random pinyin substitution training by sequentially adopting the training samples in the third training sample set, and taking the polyphone random pinyin after the training as the phoneme prediction model, wherein the mask and the substitution processing are not required to be carried out on the single-word pinyin corresponding to the polyphone in the phoneme prediction process.

In an embodiment, the step of obtaining the training samples includes:

s0211: acquiring a plurality of text data to be processed, wherein each text data to be processed in the plurality of text data to be processed comprises at least one polyphone;

s0212: respectively carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on each text data to be processed in the plurality of text data to be processed to obtain the text sample data corresponding to the plurality of text data to be processed;

s0213: and performing pinyin calibration on each text sample data corresponding to the plurality of text data to be processed respectively to obtain the phoneme calibration data corresponding to the text sample data corresponding to each text data to be processed, wherein the phoneme calibration data carries a multi-phoneme mark.

According to the method and the device, sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction are carried out on the text data to be processed to obtain text sample data, so that the accuracy of the phoneme prediction model obtained through training is improved.

For S0211, a plurality of text data to be processed may be acquired from the database, a plurality of text data to be processed input by the user may also be acquired, and a plurality of text data to be processed transmitted by the third-party application system may also be acquired.

For S0212, extracting one text data to be processed from the plurality of text data to be processed to obtain target text data to be processed; carrying out sentence structure analysis on the target text data to be processed to obtain sample data to be regularized; performing text regularization processing on the sample data to be regularized to obtain sample data to be segmented; performing word segmentation processing on the sample data of the word to be segmented to obtain the sample data after word segmentation; performing part-of-speech prediction on the sample data after word segmentation to obtain the text sample data corresponding to the target text data to be processed; and repeatedly executing the step of extracting one text data to be processed from the plurality of text data to be processed to obtain the target text data to be processed until the text sample data corresponding to each text data to be processed in the plurality of text data to be processed is determined.

For S0213, extracting a text sample data from the text sample data corresponding to each of the plurality of text data to be processed to obtain text sample data to be calibrated; performing pinyin calibration on the text sample data to be calibrated to obtain the phoneme calibration data corresponding to the text sample data to be calibrated; repeatedly executing the step of extracting a text sample data from the text sample data corresponding to the text data to be processed to obtain a text sample data to be calibrated until the phoneme calibration data corresponding to the text sample data corresponding to the text data to be processed is determined.

In an embodiment, the step of dividing the training samples by using a preset dividing rule to obtain a first training sample set, a second training sample set, and a third training sample set includes:

In the embodiment, the training samples are divided according to the division rule of 8:1:1, so that the method is favorable for performing polyphone mask training by adopting a large number of training samples, performing polyphone random pinyin alternative training by using a small number of training samples, and performing non-mask and non-alternative training by using a small number of training samples, thereby being favorable for improving the robustness of the model obtained by training.

Wherein 80% of the training samples in the plurality of training samples are divided into a first set of training samples, 10% of the training samples in the plurality of training samples are divided into a second set of training samples, and 10% of the training samples in the plurality of training samples are divided into a third set of training samples.

In an embodiment, the step of performing polyphonic mask training on the initial model by using the first training sample set, and obtaining the initial model after the polyphonic mask training after the training is finished includes:

s0231: obtaining a training sample from the first training sample set to obtain a first training sample;

s0232: inputting the text sample data of the first training sample into the initial model to perform single character pinyin prediction, and performing mask and polyphone phoneme prediction on single character pinyin corresponding to polyphone based on polyphone marks of the phoneme calibration data of the first training sample to obtain a first sample phoneme prediction value;

s0233: extracting a phoneme predicted value of a polyphone from the first sample phoneme predicted value based on the polyphone label of the phoneme calibration data of the first training sample to obtain a first polyphone phoneme predicted value;

s0234: extracting phoneme calibration values of polyphones from the phoneme calibration data of the first training sample based on the polyphone labels of the phoneme calibration data of the first training sample to obtain first polyphone phoneme calibration values;

s0235: inputting the first polyphone phoneme predicted value and the first polyphone phoneme calibration value into a first loss function for calculation to obtain a first loss value of the initial model, updating parameters of the initial model according to the first loss value, and using the updated initial model for calculating the first sample phoneme predicted value next time;

s0236: repeatedly executing the step of obtaining a training sample from the first training sample set to obtain a first training sample until the first loss value reaches a first convergence condition or the number of iterations reaches the number of the training samples in the first training sample set, and determining the initial model with the first loss value reaching the first convergence condition or the number of iterations reaching the number of the training samples in the first training sample set as the initial model after the polyphonic mask training;

wherein the first loss function is a cross entropy loss function.

According to the method and the device, the initial model is trained through the polyphone mask, so that the context semantic recognition capability of the model is improved, a vocabulary or a rule does not need to be manually and newly added, and the coverage and intelligence of polyphone phoneme prediction are improved.

For S0231, one training sample is sequentially obtained from the first training sample set, and the obtained training sample is used as the first training sample.

For S0232, inputting the text sample data of the first training sample into the initial model to perform phoneme prediction according to the sequence of single character pinyin prediction, masking the single character pinyin corresponding to the polyphone, and polyphone phoneme prediction, wherein the polyphone mark of the phoneme calibration data based on the first training sample masks the single character pinyin corresponding to the polyphone. For example, the text sample data of the first training sample is "i have long hair", and single-character pinyin prediction is performed on "i have long hair" to obtain "[ CLS ] wo3you4 chang2 tou2 fa4 le 5", the phoneme calibration data of the first training sample is "wo 3you4 zhang3# tou2 fa4 le 5", and masking is performed on single-character pinyin corresponding to polyphone based on polyphone marks of the phoneme calibration data of the first training sample to obtain "[ CLS ] wo3you4[ mask ] tou2 fa4 le 5", which is not specifically limited by this example.

For S0233, based on the position data of the polyphonic labels of the phoneme calibration data of the first training sample, extracting a phoneme predicted value of a polyphonic word from the first sample phoneme predicted value, and taking the extracted data as a first polyphonic word phoneme predicted value. For example, the first sample phoneme predicted value is "[ CLS ] wo3you4 zhang3 tou2 fa4 le 5", the phoneme labeled data of the first training sample is "wo 3you4 zhang3# tou2 fa4 le 5", and the position data of the polyphonic label of the phoneme labeled data of the first training sample is the 3 rd word, at this time, the "zhang 3" in the first sample phoneme predicted value of "[ CLS ] wo3you4 zhang3 tou2 fa4 le 5" may be extracted as the first polyphonic phoneme predicted value, which is not specifically limited in this example.

For S0234, based on the polyphonic labels of the phoneme labeled data of the first training sample, extracting phoneme labeled values of polyphonic words from the phoneme labeled data of the first training sample, and taking the extracted data as a first polyphonic phoneme labeled value. For example, the phoneme calibration data of the first training sample is "wo 3you4 zhang3# tou2 fa4 le 5", where "#" is a polyphonic mark, and a phoneme "zhang 3" marked by the polyphonic mark is used as a first polyphonic phoneme calibration value, which is not limited in this example.

For S0235, when the first loss function adopts a cross entropy loss function, the method of inputting the first polyphonic phoneme prediction value and the first polyphonic phoneme calibration value into the first loss function for calculation may be selected from the prior art, and is not described herein again.

For S0236, repeating steps S0231 to S0236 until the first loss value reaches a first convergence condition or the number of iterations reaches the number of training samples in the first set of training samples.

The first convergence condition means that the magnitudes of the first loss values calculated in two adjacent times satisfy a lipschitz condition (lipschitz continuity condition).

The iteration number refers to the number of times the initial model is used for calculating the first sample phoneme predicted value, that is, the iteration number is increased by 1 when the first sample phoneme predicted value is calculated once.

In an embodiment, the step of performing polyphone random pinyin replacement training on the initial model after the polyphone mask training by using the second training sample set, and obtaining the polyphone random pinyin replacement training on the initial model after the training is finished includes:

s0241: obtaining a training sample from the second training sample set to obtain a second training sample;

s0242: inputting the text sample data of the second training sample into the initial model after the polyphonic mask training for single character pinyin prediction, and randomly replacing single character pinyin corresponding to polyphonic characters by pinyin and predicting polyphonic characters based on polyphonic character marks of the phoneme calibration data of the second training sample to obtain a second sample phoneme prediction value;

s0243: extracting a phoneme predicted value of a polyphone from the phoneme predicted value of the second sample based on the polyphone label of the phoneme calibration data of the second training sample to obtain a phoneme predicted value of the second polyphone;

s0244: extracting phoneme calibration values of polyphones from the phoneme calibration data of the second training sample based on the polyphone labels of the phoneme calibration data of the second training sample to obtain second polyphone phoneme calibration values;

s0245: inputting the second polyphonic phoneme predicted value and the second polyphonic phoneme calibration value into a second loss function for calculation to obtain a second loss value of the initial model after the polyphonic mask training, updating parameters of the initial model after the polyphonic mask training according to the second loss value, and using the updated initial model after the polyphonic mask training for calculating the second sample phoneme predicted value next time;

s0246: repeatedly executing the step of obtaining a training sample from the second training sample set to obtain a second training sample until the second loss value reaches a second convergence condition or the number of iterations reaches the number of the training samples in the second training sample set, and determining the initial model after the polyphone mask training, in which the second loss value reaches the second convergence condition or the number of iterations reaches the number of the training samples in the second training sample set, as the initial model after the polyphone random pinyin substitution training;

wherein the second loss function is a cross-entropy loss function.

According to the method and the device, polyphone random pinyin substitution training is performed on the initial model after the polyphone mask training, so that the abnormal processing capability of the model is improved, and the robustness of the model is improved.

For S0241, one training sample is sequentially obtained from the second training sample set, and the obtained training sample is used as the second training sample.

For S0242, inputting the text sample data of the second training sample into the initial model after the polyphonic mask training to perform single character pinyin prediction, randomly replacing single character pinyin corresponding to polyphonic characters with pinyin, and performing phoneme prediction in the order of polyphonic character phoneme prediction, wherein the polyphonic character marks of the phoneme calibration data based on the second training sample randomly replace the single character pinyin corresponding to polyphonic characters with pinyin. For example, the text sample data of the second training sample is "length, width and height", the single-character pinyin prediction is performed on "length, width and height" to obtain "[ CLS ] zhang3 kuan1 gao 1", the phoneme mark data of the second training sample is "chan 2# kuan1 gao 1", the single-character pinyin corresponding to the polyphone is masked based on the polyphone mark of the phoneme mark data of the second training sample to obtain "[ CLS ] jian3 kuan1 gao 1", that is, the "chan 2" is replaced by the pinyin "jian 3", which is not limited in this example.

For S0243, extracting a phoneme predicted value of a polyphone from the second sample phoneme predicted value based on the position data of the polyphone labels of the phoneme calibration data of the second training sample, and taking the extracted data as a second polyphone phoneme predicted value. For example, the second sample phoneme predicted value is "[ CLS ] chang2 kuan1 gao 1", the phoneme calibration data of the second training sample is "chang 2# kuan1 gao 1", and the position data of the multi-phoneme label of the phoneme calibration data of the second training sample is the 1 st word, at this time, "chang 2" in the second sample phoneme predicted value of "[ CLS ] chang2 kuan1 gao 1" may be extracted as the second multi-phoneme predicted value, which is not limited in this example.

For S0244, extracting phoneme calibration values of polyphones from the phoneme calibration data of the second training sample based on the polyphone labeling of the phoneme calibration data of the second training sample, and taking the extracted data as second polyphone phoneme calibration values. For example, the phoneme calibration data of the second training sample is "chang 2# kuan1 gao 1", where "#" is a polyphonic mark, and a phoneme "chang 2" marked by the polyphonic mark is used as a second polyphonic phoneme calibration value, which is not specifically limited by the example.

For S0245, when the second loss function adopts a cross entropy loss function, the method of inputting the predicted value of the second polyphonic phoneme and the calibrated value of the second polyphonic phoneme into the second loss function for calculation may be selected from the prior art, and is not described herein again.

For S0246, repeating steps S0241 to S0246 until the second loss value reaches a second convergence condition or the number of iterations reaches the number of training samples in the second set of training samples.

The second convergence condition means that the magnitude of the second loss value calculated in two adjacent times satisfies the lipschitz condition (lipschitz continuous condition).

The iteration count reaching the iteration count in the number of training samples in the second training sample set is the number of times that the initial model after the training of the polyphonic mask is used for calculating the phoneme predicted value of the second sample, that is, the number of iterations is increased by 1 when the phoneme predicted value of the second sample is calculated once.

In an embodiment, the step of training the initial model after the random pinyin substitution training for the polyphone by using the third training sample set and obtaining the phoneme prediction model after the training is finished includes:

s0251: obtaining a training sample from the third training sample set to obtain a third training sample;

s0252: inputting the text sample data of the third training sample into the initial model after the polyphone random pinyin alternative training for phoneme prediction to obtain a third sample phoneme prediction value;

s0253: extracting a phoneme predicted value of a polyphone from the phoneme predicted value of the third sample based on a polyphone label of the phoneme calibration data of the third training sample to obtain a third polyphone phoneme predicted value;

s0254: extracting phoneme calibration values of polyphones from the phoneme calibration data of the third training sample based on the polyphone labels of the phoneme calibration data of the third training sample to obtain third polyphone phoneme calibration values;

s0255: inputting the third polyphone phoneme predicted value and the third polyphone phoneme calibration value into a third loss function for calculation to obtain a third loss value of the initial model after the polyphone random pinyin substitution training, updating the parameter of the initial model after the polyphone random pinyin substitution training according to the third loss value, and using the updated initial model after the polyphone random pinyin substitution training for calculating the third sample phoneme predicted value next time;

s0256: repeatedly executing the step of obtaining a training sample from the third training sample set to obtain a third training sample until the third loss value reaches a third convergence condition or the number of iterations reaches the number of the training samples in the third training sample set, and determining the initial model after the training as the phoneme prediction model by replacing the polyphone random pinyin with the polyphone random pinyin, wherein the polyphone random pinyin with the third loss value reaches the third convergence condition or the number of iterations reaches the number of the training samples in the third training sample set;

wherein the third loss function is a cross-entropy loss function.

The embodiment realizes the training of the initial model after the random pinyin substitution training of the polyphones, and does not need to carry out mask and substitution processing on the pinyin of the single character corresponding to the polyphones in the phoneme prediction process, thereby improving the accuracy of model prediction.

And for S0251, sequentially obtaining a training sample from the third training sample set, and taking the obtained training sample as a third training sample.

And for S0252, inputting the text sample data of the third training sample into the initial model after the polyphone random pinyin substitution training for phoneme prediction. For example, the text sample data of the third training sample is "long and high", and the "long and high" is subjected to single-character pinyin prediction to obtain "[ CLS ] zhang3 gao1le 5", and the phoneme calibration data of the third training sample is "zhang 3# gao1le 5", which is not limited in this example.

For S0253, extracting a phoneme predicted value of a polyphone from the third sample phoneme predicted value based on the position data of the polyphone label of the phoneme labeled data of the third training sample, and using the extracted data as a third polyphone phoneme predicted value. For example, the predicted value of the phoneme of the third sample is "[ CLS ] zhang3 gao1le 5", the phoneme calibration data of the third training sample is "zhang 3# gao1le 5", and the position data of the polyphonic mark of the phoneme calibration data of the third training sample is the 1 st word, at this time, "zhang 3" in the predicted value of the phoneme of the third sample is "[ CLS ] zhang3 gao1le 5" may be extracted as the predicted value of the phoneme of the third polyphonic word, which is not limited in this example.

For S0254, based on said polyphonic labels of said phoneme label data of said third training sample, extracting phoneme label values of polyphonic words from said phoneme label data of said third training sample, and using the extracted data as third polyphonic phoneme label values. For example, the phoneme calibration data of the third training sample is "zhang 3# gao1le 5", where "#" is a polyphonic sign, and a phoneme "zhang 3" marked by the polyphonic sign is used as a third polyphonic phoneme calibration value, which is not specifically limited by the example herein.

For S0255, when the third loss function adopts a cross entropy loss function, the method of inputting the third polyphonic phoneme prediction value and the third polyphonic phoneme calibration value into the third loss function for calculation may be selected from the prior art, and is not described herein again.

For S0256, repeating steps S0251-S0256 until the third loss value reaches a third convergence condition or the number of iterations reaches the number of training samples in the third set of training samples.

The third convergence condition means that the magnitude of the third loss value calculated in two adjacent times satisfies the lipschitz condition (lipschitz continuous condition).

The iteration times reaching the iteration times in the number of the training samples in the third training sample set refer to the times that the initial model after the random pinyin substitution training of the polyphones is used for calculating the phoneme predicted value of the third sample, that is, the number of iterations is increased by 1 when the phoneme predicted value of the third sample is calculated once.

With reference to fig. 2, the present application also proposes a phoneme prediction apparatus for polyphones, the apparatus comprising:

a data obtaining module 100, configured to obtain text data to be predicted;

the preprocessing module 200 is configured to perform sentence structure analysis, text regularization processing, word segmentation processing, and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data;

a phoneme prediction module 300, configured to input the preprocessed text data into a phoneme prediction model for phoneme prediction, where the phoneme prediction model is a model obtained by training based on a Bert model, multiple training samples containing polyphones, and an MLM model training method, and the MLM model training method includes: a polyphone mask training method, a polyphone random pinyin substitution training method and a non-mask and non-shielding training method;

and a target phoneme prediction result determining module 400, configured to obtain a target phoneme prediction result output by the phoneme prediction model.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data such as a phoneme prediction method suitable for polyphones. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a phoneme prediction method for polyphones. The phoneme prediction method suitable for the polyphones comprises the following steps: acquiring text data to be predicted; carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data; inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM (multi-level modeling) model training method, and the MLM model training method comprises the following steps: a polyphone mask training method, a polyphone random pinyin substitution training method and a non-mask and non-shielding training method; and acquiring a target phoneme prediction result output by the phoneme prediction model.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing a phoneme prediction method for polyphones, comprising the steps of: acquiring text data to be predicted; carrying out sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted to obtain preprocessed text data; inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM (multi-level modeling) model training method, and the MLM model training method comprises the following steps: a polyphone mask training method, a polyphone random pinyin substitution training method and a non-mask and non-shielding training method; and acquiring a target phoneme prediction result output by the phoneme prediction model.

According to the executed phoneme prediction method suitable for polyphones, the preprocessed text data are obtained by performing sentence structure analysis, text regularization processing, word segmentation processing and part-of-speech prediction on the text data to be predicted, so that the accuracy of phoneme prediction is improved; inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, wherein the phoneme prediction model is a model obtained by training based on a Bert model, a plurality of training samples containing polyphones and an MLM (multi-level modeling) model training method, and the MLM model training method comprises the following steps: a polyphone mask training method, a polyphone random pinyin substitution training method and a non-mask and non-shielding training method; therefore, by means of the excellent context semantic recognition capability of the Bert model, a word list or a rule does not need to be manually added, and the coverage and intelligence of polyphone phoneme prediction are improved; and a plurality of training samples containing polyphones and an MLM model training method are adopted for training to obtain a phoneme prediction model, so that the prediction capability of phonemes of polyphones in a complex context is improved, the maintenance cost is reduced, and the coverage and accuracy of polyphone phoneme prediction are improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A phoneme prediction method for polyphones, the method comprising:

acquiring text data to be predicted;

2. The method for predicting phonemes for polyphones according to claim 1, wherein before the step of inputting the preprocessed text data into a phoneme prediction model for phoneme prediction, the method further comprises:

3. The method of claim 2, wherein the step of obtaining the training samples comprises:

4. The method for predicting phonemes suitable for polyphones according to claim 2, wherein the step of dividing the plurality of training samples by using a preset dividing rule to obtain a first training sample set, a second training sample set, and a third training sample set includes:

5. The method for predicting phonemes for polyphones according to claim 2, wherein the step of performing polyphone mask training on the initial model by using the first training sample set and obtaining the initial model after the polyphone mask training after the training is completed comprises:

wherein the first loss function is a cross entropy loss function.

6. The method for predicting phonemes suitable for polyphones according to claim 2, wherein the step of performing polyphone random pinyin replacement training on the initial model after the polyphone mask training by using the second training sample set and obtaining the initial model after the polyphone random pinyin replacement training after the training is finished includes:

wherein the second loss function is a cross-entropy loss function.

7. The method as claimed in claim 2, wherein the step of training the initial model after the random pinyin substitution training for the polyphone by using the third training sample set and obtaining the phoneme prediction model after the training is finished comprises:

wherein the third loss function is a cross-entropy loss function.

8. A phoneme prediction apparatus adapted for polyphones, the apparatus comprising:

the data acquisition module is used for acquiring text data to be predicted;

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.