CN114742044A

CN114742044A - Information processing method and device and electronic equipment

Info

Publication number: CN114742044A
Application number: CN202210268350.0A
Authority: CN
Inventors: 马思凡; 郭莉莉; 赵泽清
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2022-07-12

Abstract

The application provides an information processing method, an information processing device and electronic equipment, wherein a text to be predicted comprises one or more polyphones, each polyphone comprises at least two pronunciations, each polyphone is used as a target word in the text to be predicted, at least two groups of pronunciation text sets corresponding to the at least two pronunciations are generated on the basis of the target word in the text to be predicted and the at least two pronunciations of the target word, and the pronunciations of the target word in the text to be predicted are further determined on the basis of the pronunciations and the text to be predicted. In the scheme, the pronunciations corresponding to the polyphones in the predicted text are further selected based on the pronunciations generated based on the polyphones in the text to be predicted and the text to be predicted, the influence of the imbalance of the training corpus is avoided, and the accuracy of the pronunciations of the polyphones is higher.

Description

Information processing method and device and electronic equipment

Technical Field

The present application relates to the field of information technologies, and in particular, to an information processing method and apparatus, and an electronic device.

Background

The pronunciation prediction of Chinese polyphones aims To correctly predict the pronunciation of polyphones in a Text, and with the help of the information, Speech synthesis (TTS, Text-To-Speech) can synthesize more natural and close To human voice, and the pronunciation prediction of polyphones is an important component of the Speech synthesis technology.

In the prior art, a BilSTM (Bi-directional Long Short-Term Memory) network and other context-related polyphonic character prediction models are trained, although context information is utilized, the obtained polyphonic character data often have a great imbalance phenomenon, common pronunciation linguistic data are many, and the unusual pronunciation linguistic data are few, while the BilSTM network is sensitive to the data imbalance phenomenon, so that the polyphonic character prediction models have low accuracy on the result of the unusual pronunciation prediction.

Disclosure of Invention

In view of this, the present application provides an information processing method, comprising:

an information processing method comprising:

acquiring a target word in a text to be predicted, wherein the target word has at least two pronunciations;

generating at least two groups of pronunciation text sets corresponding to at least two pronunciations based on the target word and the at least two pronunciations, wherein each group of pronunciation text set corresponds to one pronunciation, and each group of pronunciation text comprises at least one pronunciation text;

and determining the pronunciation of the target word in the text to be predicted based on the similarity between the at least two groups of pronunciation text sets and the text to be predicted.

Optionally, in the foregoing method, at least two sets of pronunciation text sets corresponding to at least two pronunciations are generated based on at least the target word and the at least two pronunciations, and the method includes:

and generating at least two groups of pronunciation text sets corresponding to the at least two pronunciations based on the text to be predicted, the target word and the at least two pronunciations, wherein the contents of the pronunciation texts in the at least two groups of pronunciation text sets are related to the contents of the text to be predicted.

Optionally, in the foregoing method, generating at least two groups of pronunciation text sets corresponding to at least two pronunciations based on the text to be predicted, the target word, and the at least two pronunciations includes:

generating a first vector based on the text to be predicted, wherein the first vector represents the content contained in the text to be predicted;

and generating at least two groups of pronunciation text sets corresponding to the at least two pronunciations based on the first vector, the target word and the at least two pronunciations, wherein each group of pronunciation text set corresponds to one pronunciation, each group of pronunciation text comprises at least one pronunciation text, and the content of each pronunciation text is related to the content of the text to be predicted.

Optionally, the method of any one of the above embodiments, determining the pronunciation of the target word in the text to be predicted based on the similarity between the at least two sets of pronunciation text sets and the text to be predicted, includes:

correspondingly generating a vector for each pronunciation text in each group of pronunciation text set, wherein the vector generated by each pronunciation text represents the content contained in the pronunciation text;

processing vectors generated by pronunciation texts belonging to the same group of pronunciation text sets to obtain mean vectors corresponding to the pronunciation text sets;

determining similarity between the pronunciation texts corresponding to the two pronunciation text sets and the text to be predicted based on a mean vector corresponding to the at least two pronunciation text sets and a first vector generated by the text to be predicted;

and determining a first pronunciation as the pronunciation of a target word in the text to be predicted based on the similarity, wherein the pronunciation text in the pronunciation text set corresponding to the first pronunciation meets the similar condition with the text to be predicted.

Optionally, in the method, the generating at least two sets of pronunciation text sets corresponding to at least two pronunciations based on the first vector, the target word, and the at least two pronunciations includes:

and generating the at least two groups of pronunciation text sets based on the target generator processing the first vector, the target word and the at least two pronunciations.

Optionally, in the foregoing method, before the obtaining of the target word in the text to be predicted, the method further includes:

training an original learning model based on a training text set, a target polyphone and at least two pronunciations of the target polyphone to obtain a target learning model, wherein the target learning model comprises a target generator and a target discriminator, the training text set comprises the target polyphone and a marked pronunciation, and the marked pronunciation is a pronunciation marked by the target polyphone in the training text.

Optionally, in the method, the training an original learning model based on a training text, a target polyphone and at least two pronunciations of the target polyphone to obtain a target learning model includes:

acquiring a training text set, wherein the training text set comprises at least two groups of training texts, and each group of training texts corresponds to one marked pronunciation;

generating a training vector based on the training text, wherein the training vector represents content contained in the training text;

inputting the training vector, the target polyphone and the marked pronunciation as input conditions into an original generator to obtain a target generator so that the target generator generates a first text;

and taking the first text and any training text as input contents of an original discriminator, so that the original discriminator judges the real conditions of the first text and the training text until a first convention stopping training condition is met, and a target learning model is obtained, wherein the target learning model comprises a target generator and a target discriminator.

Optionally, in the method, the inputting the training vector, the target polyphone and the labeled pronunciation as input conditions into an original generator to obtain a target generator includes:

inputting the training vector, the target polyphone and the marked pronunciation as input conditions into an original generator to generate a second text;

adding a label judged to be false to the second text, taking the second text added with the label as a training text, returning to execute the step of generating a training vector based on the training text until a second appointed training stopping condition is met, and obtaining a target generator;

and inputting the training vector, the target polyphone and the marked pronunciation as input conditions into a target generator so that the target generator generates a first text.

An information processing apparatus comprising:

the system comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring a target word in a text to be predicted, and the target word is provided with at least two pronunciations;

the generating module is used for generating at least two groups of pronunciation text sets corresponding to at least two pronunciations based on the target word and the at least two pronunciations, wherein each group of pronunciation text set corresponds to one pronunciation, and each group of pronunciation text comprises at least one pronunciation text;

and the determining module is used for determining the pronunciation of the target word in the text to be predicted based on the similarity between the at least two groups of pronunciation text sets and the text to be predicted.

An electronic device, comprising: a memory, a processor;

wherein, the memory stores a processing program;

the processor is configured to load and execute the processing program stored in the memory to implement the steps of the information processing method according to any one of the above.

The foregoing technical solution is to provide an information processing method, where a text to be predicted includes one or more polyphones, each polyphone has at least two pronunciations, the polyphone is used as a target word in the text to be predicted, at least two sets of pronunciation text sets corresponding to the at least two pronunciations are generated based on the target word in the text to be predicted and the at least two pronunciations of the target word, and the pronunciation of the target word in the text to be predicted is further determined based on the plurality of pronunciation texts and the text to be predicted. In the scheme, the pronunciations corresponding to the polyphones in the predicted text are further selected based on the pronunciations generated based on the polyphones in the text to be predicted and the text to be predicted, the influence of the imbalance of the training corpus is avoided, and the accuracy of the pronunciations of the polyphones is higher.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.

Fig. 1 is a flowchart of an information processing method embodiment 1 provided in the present application;

fig. 2 is a flowchart of an information processing method embodiment 3 provided in the present application;

fig. 3 is a flowchart of an embodiment 3 of an information processing method provided in the present application;

fig. 4 is a flowchart of an embodiment 4 of an information processing method provided in the present application;

fig. 5 is a flowchart of an embodiment 5 of an information processing method provided in the present application;

fig. 6 is a schematic diagram of a learning model in embodiment 5 of an information processing method provided in the present application;

FIG. 7 is a flow chart of embodiment 6 of an information processing method provided by the present application;

fig. 8 is a flowchart of embodiment 7 of an information processing method provided in the present application;

fig. 9 is a schematic structural diagram of an embodiment of an information processing apparatus provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1, a flowchart of an embodiment 1 of an information processing method provided by the present application is applied to an electronic device, and the method includes the following steps:

step S101: acquiring a target word in a text to be predicted, wherein the target word has at least two pronunciations;

the text to be predicted comprises one or more words, wherein the words comprise at least one target word, and the target word has two or more pronunciations.

For example, the text to be predicted is "thousand-fold", where the "fold" has two pronunciations, respectively "hu a n" and "h a i".

Wherein, a plurality of pronunciations of the target word in the text to be predicted are known, and in the scheme, which pronunciation of the target word in the text to be predicted is determined.

Step S102: generating at least two groups of pronunciation text sets corresponding to at least two pronunciations based on the target word and the at least two pronunciations;

each group of pronunciation text set corresponds to a pronunciation, and each group of pronunciation text includes at least one pronunciation text.

And generating a plurality of groups of pronunciation text sets based on the target character and the pronunciations of the target character, wherein each group of pronunciation text set corresponds to one pronunciation.

For example, the target word is "still", which has two readings, and two text sets corresponding to the two readings are generated respectively, for example, three pronunciations in the text set corresponding to the "hu n" reading, and a "still" word with a reading of "hu n" in each pronunciation text, for example, three pronunciations in the text set corresponding to the "h i" reading, and a "still" word with a reading of "h i" in each pronunciation text.

The pronunciation text comprises target words, and the number of the words of the pronunciation text is random and can be the same as or different from that of the text to be predicted; the position of the target word in the pronunciation text is random, and can be the same as or different from the position of the target word in the text to be predicted.

Step S103: and determining the pronunciation of the target word in the text to be predicted based on the similarity between the at least two groups of pronunciation text sets and the text to be predicted.

And respectively determining the similarity between each group of pronunciation text set and the text to be predicted, and selecting the pronunciation corresponding to the text set with the highest similarity as the pronunciation of the target character in the text to be predicted.

Specifically, each group of pronunciation text set is taken as a whole, the similarity between the whole and the text to be predicted is calculated, the pronunciation text set corresponding to the maximum value in the calculated similarities is selected, and the pronunciation corresponding to the pronunciation text set is determined to be the pronunciation of the target word in the text to be predicted.

It should be noted that the higher the similarity between the pronunciation text set and the text to be predicted is, the more similar the pronunciation text in the pronunciation text set and the text to be predicted is, the higher the possibility that the pronunciations of the target characters in the pronunciation text set and the text to be predicted are the same is, so that the similarity between the pronunciation text set and the text to be predicted is adopted to determine the pronunciation of the target character in the application, which is not affected by the imbalance of the training corpus, and the accuracy of predicting the pronunciations of polyphonic characters is higher.

In summary, in the information processing method provided in this embodiment, the text to be predicted includes one or more polyphones, each polyphone has at least two pronunciations, the polyphone is used as a target word in the text to be predicted, at least two sets of pronunciation text sets corresponding to the at least two pronunciations are generated based on the target word in the text to be predicted and the at least two pronunciations of the target word, and the pronunciation of the target word in the text to be predicted is further determined based on the plurality of pronunciation texts and the text to be predicted. In the scheme, the pronunciations corresponding to the polyphones in the predicted text are further selected based on the pronunciations generated based on the polyphones in the text to be predicted and the text to be predicted, the influence of the imbalance of the training corpus is avoided, and the accuracy of the pronunciations of the polyphones is higher.

As shown in fig. 2, a flowchart of embodiment 2 of an information processing method provided by the present application includes the following steps:

step S201: acquiring target words in a text to be predicted, wherein the target words have at least two pronunciations;

step S201 is the same as step S101 in embodiment 1, and details are not described in this embodiment.

Step S202: generating at least two groups of pronunciation text sets corresponding to at least two pronunciations based on the text to be predicted, the target word and the at least two pronunciations;

wherein the contents of the pronunciation texts in the at least two groups of pronunciation text sets are related to the contents of the text to be predicted.

The method comprises the steps of generating a plurality of groups of pronunciation text sets based on a text to be predicted, a target word and a plurality of pronunciations of the target word, wherein the generated pronunciation text sets are respectively related to the text to be predicted, the target word and the pronunciations of the target word.

Therefore, in the scheme, the generated pronunciation text set includes the text to be predicted, and the pronunciation text in the generated pronunciation text set is related to the text to be predicted, specifically, the contents of the pronunciation text and the text to be predicted are related, namely, the contents of the pronunciation text and the text to be predicted belong to the same or similar fields.

For example, if the text to be predicted is in the medicine field, the generated pronunciation text in the pronunciation text set is also in the medicine field, so as to improve the similarity between the whole pronunciation text set and the text to be predicted and improve the accuracy of predicting the pronunciation of polyphones.

Step S203: and determining the pronunciation of the target word in the text to be predicted based on the similarity between the at least two groups of pronunciation text sets and the text to be predicted.

Step S203 is the same as step S103 in embodiment 1, and details are not described in this embodiment.

In summary, in an information processing method provided in this embodiment, generating at least two sets of pronunciation text sets corresponding to at least two pronunciations based on at least the target word and the at least two pronunciations includes: and generating at least two groups of pronunciation text sets corresponding to the at least two pronunciations based on the text to be predicted, the target word and the at least two pronunciations, wherein the contents of the pronunciation texts in the at least two groups of pronunciation text sets are related to the contents of the text to be predicted. In the scheme, the factors of the generated pronunciation text set comprise the text to be predicted, and the pronunciation text in the generated pronunciation text set is related to the text to be predicted, so that the similarity between the whole pronunciation text set and the text to be predicted is improved, and the accuracy of predicting the pronunciation of the polyphone is improved.

As shown in fig. 3, a flowchart of embodiment 3 of an information processing method provided by the present application includes the following steps:

step S301: acquiring a target word in a text to be predicted, wherein the target word has at least two pronunciations;

step S301 is the same as step S201 in embodiment 2, and details are not described in this embodiment.

Step S302: generating a first vector based on the text to be predicted, wherein the first vector represents the content contained in the text to be predicted;

the text to be predicted is specifically a sentence, and the sentence can be composed of one word or a plurality of words.

And processing the text to be predicted by adopting a set word2vec network to generate a first vector, wherein the first vector is specifically represented by embedding of the target word in the text to be predicted, and the first vector contains the content contained in the text to be predicted.

Specifically, the word2vec network has two inputs, one is a text to be predicted, the other is the relative position of the current word with respect to the target word, and the word2vec network output is the embedding representation of the target word contained in the text to be predicted.

The first vector is obtained by processing the whole sentence of the text to be predicted and contains all contents contained in the text to be predicted.

The first vector is a vector aiming at the text to be predicted containing the target word, and the vector implies information based on the position of the target word in the text to be predicted and the like.

Specifically, the word2vec network is trained in an unsupervised manner in advance, so that the trained word2vec network can process an input sentence to generate a sentence-level vector.

It should be noted that, the sentence-level vector obtained by processing based on sentences, and the similarity between sentences corresponding to two vectors can be determined based on two sentence-level vectors, and the closer the vectors are, the more similar the corresponding sentences are.

Step S303: generating at least two groups of pronunciation text sets corresponding to at least two pronunciations based on the first vector, the target word and the at least two pronunciations;

each group of pronunciation text set corresponds to one pronunciation, each group of pronunciation text comprises at least one pronunciation text, and the content of each pronunciation text is related to the content of the text to be predicted.

And generating a plurality of groups of pronunciation text sets corresponding to a plurality of voices in the first vector based on the first vector, the target word and a plurality of pronunciations of the target word.

The first vector is used as a factor for generating the pronunciation text, and the pronunciation text in the generated pronunciation text set is related to the text to be predicted, specifically the contents of the pronunciation text and the text to be predicted are related, namely the pronunciation text and the text to be predicted belong to the same or similar fields.

Correspondingly, based on the number of pronunciations of the target word, a pronunciation text set of an array corresponding to the number of pronunciations is generated, each pronunciation text set comprises one or more pronunciation texts, and in order to improve the accuracy of prediction, a plurality of pronunciation texts are generally generated in each pronunciation text set.

In the present scheme, a first vector generated based on a text to be predicted is predicted by combining a vector with a target word and its pronunciation, and the complexity of data processing in the prediction process is low compared with the prediction by combining the target word and its pronunciation with the sentence text.

Step S304: and determining the pronunciation of the target word in the text to be predicted based on the similarity between the at least two groups of pronunciation text sets and the text to be predicted.

Step S304 is the same as step S203 in embodiment 2, and details are not described in this embodiment.

In summary, in an information processing method provided in this embodiment, generating at least two sets of pronunciation text sets corresponding to at least two pronunciations based on the text to be predicted, the target word, and the at least two pronunciations includes: generating a first vector based on the text to be predicted, wherein the first vector represents the content contained in the text to be predicted; and generating at least two groups of pronunciation text sets corresponding to the at least two pronunciations based on the first vector, the target word and the at least two pronunciations, wherein each group of pronunciation text set corresponds to one pronunciation, each group of pronunciation text comprises at least one pronunciation text, and the content of each pronunciation text is related to the content of the text to be predicted. In the scheme, the first vector generated based on the text to be predicted is predicted by combining the target word and the pronunciation thereof with one vector, and compared with the prediction by combining the sentence text with the target word and the pronunciation thereof, the complexity of data processing in the prediction process is low.

As shown in fig. 4, a flowchart of embodiment 4 of an information processing method provided by the present application includes the following steps:

step S401: acquiring a target word in a text to be predicted, wherein the target word has at least two pronunciations;

step S402: generating at least two groups of pronunciation text sets corresponding to at least two pronunciations based on the target word and the at least two pronunciations;

steps S401 to 402 are the same as steps S101 to 102 in embodiment 1, and are not described in detail in this embodiment.

Step S403: correspondingly generating a vector for each pronunciation text in each group of pronunciation text set;

wherein, the generated vector of each pronunciation text represents the content contained in the pronunciation text.

In the scheme, the similarity between the generated pronunciation text and the text to be predicted is determined according to the difference between the vectors, and the pronunciation of the target character in the text to be predicted is further determined.

And generating a sentence-level vector for each pronunciation text in each group of pronunciation text set, wherein each vector represents the content contained in the corresponding pronunciation text.

Each pronunciation text contains the target word, and the sentence-level vector generated by the pronunciation text combines with the context information of the target word and contains the whole content of the pronunciation text.

Step S404: processing vectors generated by pronunciation texts belonging to the same group of pronunciation text set to obtain a mean vector corresponding to the pronunciation text set;

and processing vectors generated by all the pronunciation texts in the same pronunciation text set to obtain a mean value vector of the pronunciation text set.

The mean vector represents the overall vector condition of the pronunciation text set, and the mean vectors of different pronunciation text sets are different.

Step S405: determining similarity between the pronunciation texts corresponding to the two pronunciation text sets and the text to be predicted based on a mean vector corresponding to the at least two pronunciation text sets and a first vector generated by the text to be predicted;

and the mean vector corresponding to each pronunciation text set represents the content of the corresponding pronunciation text.

Specifically, a cosine distance between a first vector generated by the text to be predicted and the mean vector is calculated, the distance represents a difference between the whole pronunciation text set and the text to be predicted, and the smaller the distance is, the higher the similarity between the pronunciation text set and the text to be predicted is.

For example, the target word "also" has two readouts, "hu an n" and "h a i," the distance between the mean vector of the hu n "pronunciation text set and the first vector is a first distance, and the distance between the mean vector of the hu i" pronunciation text set and the first vector is a second distance, wherein the first distance is smaller than the second distance, which represents that the similarity between the "hu a n" pronunciation text set and the text to be predicted is higher.

Step S406: and determining the first pronunciation as the pronunciation of the target word in the text to be predicted based on the similarity.

And meeting similar conditions of the pronunciation text in the pronunciation text set corresponding to the first pronunciation and the text to be predicted.

After the similarity between the pronunciation text set corresponding to the pronunciations and the text to be predicted is determined, the similarity is sequenced, and the pronunciation corresponding to the pronunciation text set with the maximum similarity is selected as the pronunciation of the target character.

The higher the similarity between the pronunciation text set and the text to be predicted is, the more similar the pronunciation text in the pronunciation text set and the text to be predicted is, the higher the possibility that the pronunciation of the target word is the same between the pronunciation text and the text to be predicted is, so that the similarity between the pronunciation text of the pronunciation text set corresponding to the first pronunciation and the text to be predicted is the highest, and the first pronunciation is determined as the pronunciation of the target word.

In specific implementation, after the cosine distance between the mean vector and the first vector of the pronunciation text set is determined, the pronunciation corresponding to the pronunciation text set with the minimum cosine distance is directly selected as the pronunciation of the target word, the similarity between the pronunciation text set and the text to be predicted does not need to be determined based on the preset distance, and then the pronunciation of the target word is selected based on the similarity, so that the data processing amount is reduced.

In summary, in an information processing method provided in this embodiment, determining a pronunciation of a target word in the text to be predicted based on similarities between the at least two sets of pronunciation text sets and the text to be predicted includes: correspondingly generating a vector for each pronunciation text in each group of pronunciation text set, wherein the generated vector for each pronunciation text represents the content contained in the pronunciation text; processing vectors generated by pronunciation texts belonging to the same group of pronunciation text sets to obtain mean vectors corresponding to the pronunciation text sets; determining similarity between the pronunciation texts corresponding to the two pronunciation text sets and the text to be predicted based on a mean vector corresponding to the at least two pronunciation text sets and a first vector generated by the text to be predicted; and determining a first pronunciation as the pronunciation of a target word in the text to be predicted based on the similarity, wherein the pronunciation text in the pronunciation text set corresponding to the first pronunciation meets the similar condition with the text to be predicted. According to the scheme, the mean vector corresponding to the pronunciation text set is determined based on vectors generated by all pronunciation texts in the pronunciation text set, the cosine distance is calculated based on the mean vector and the first vector of the text to be predicted, the similarity between the pronunciation text set and the text to be predicted is determined based on the cosine distance, then the pronunciation of the target word is determined by adopting the similarity between the pronunciation text set and the text to be predicted, the influence of imbalance of training linguistic data is avoided, and the accuracy of pronunciation prediction of polyphones is higher.

As shown in fig. 5, a flowchart of embodiment 5 of an information processing method provided by the present application includes the following steps:

step S501: acquiring a target word in a text to be predicted, wherein the target word has at least two pronunciations;

step S502: generating a first vector based on the text to be predicted, wherein the first vector represents the content contained in the text to be predicted;

steps S501 to 502 are the same as steps S301 to 302 in embodiment 3, and are not described in detail in this embodiment.

Step S503: processing the first vector, the target word and the at least two pronunciations based on a target generator to generate at least two groups of pronunciation text sets;

in the scheme, a pronunciation text set is generated based on a target generator in a learning model, wherein the learning model adopts a condition GAN model which is a trained model.

The condition GAN model includes a generator that generates output contents based on input conditions, and an arbiter that scores the output contents, for example, a score of 1 if the output contents are close to reality and meet the conditions, or a score of 0 if the output contents are low in quality or do not meet the conditions.

Fig. 6 is a schematic diagram of a learning model, which includes a generator 601 and a discriminator 602, the generator generates an output content based on an input condition, and the discriminator discriminates the output content. The generator aims to generate real content as much as possible based on input conditions and deceive the discriminator. The goal of the arbiter is to separate the output content and the real content generated by the generator as much as possible, and the generator and the arbiter form a dynamic "game process".

Specifically, the target generator based on the condition GAN model after the training is completed inputs the first vector generated based on the text to be predicted, the target word and at least two pronunciations of the target word as input conditions into the target generator, so that the target generator generates at least two sets of pronunciation text sets based on the input conditions.

In particular, due to the working rules of the generator, a random signal needs to be input into the generator in addition to the input conditions, so that the target generator generates the content based on the input conditions.

Due to the generator in the trained condition GAN model of the target generator, the target generator can generate output content meeting input conditions, that is, at least two sets of pronunciation text sets generated based on the target generator are consistent with the text to be predicted as input conditions, and the consistency may specifically be in the same field.

Step S504: and determining the pronunciation of the target word in the text to be predicted based on the similarity between the at least two groups of pronunciation text sets and the text to be predicted.

Step S504 is the same as step S304 in embodiment 3, and details are not described in this embodiment.

In summary, in an information processing method provided in this embodiment, the generating at least two sets of pronunciation text sets corresponding to at least two pronunciations based on the first vector, the target word, and the at least two pronunciations includes: and generating the at least two groups of pronunciation text sets based on the processing of the first vector, the target word and the at least two pronunciations by a target generator. In the scheme, a first vector generated based on a text to be predicted, a target word and at least two pronunciations of the target word are used as input conditions, and a target generator of a condition GAN model is input, so that the target generator generates at least two groups of pronunciation text sets based on the input conditions, and generates at least two groups of pronunciation text sets meeting the input conditions.

As shown in fig. 7, a flowchart of embodiment 6 of an information processing method provided by the present application includes the following steps:

step S701: training an original learning model based on a training text set, a target polyphone and at least two pronunciations of the target polyphone to obtain a target learning model;

the target learning model comprises a target generator and a target discriminator, the training text set comprises the target polyphone and a marked pronunciation, and the marked pronunciation is the pronunciation marked by the target polyphone in the training text.

The method comprises the steps of presetting a training text set, wherein the training text set comprises a large number of training texts, each training text at least comprises a polyphone, and the polyphone is marked with pronunciation.

And training the original learning model based on the training text set to obtain a target learning model.

Wherein, the training text set comprises common pronunciation and uncommon pronunciation.

In the scheme, the original learning model is trained, and the training process comprises training of the generator and the discriminator to obtain the target learning model.

The target generator in the target learning model can generate output content which is in accordance with the input condition so as to ensure that a pronunciation text containing a target word can be generated in the subsequent step based on the text to be predicted, the target word contained in the text to be predicted and a plurality of pronunciations of the target word, the generated pronunciation text is in accordance with the target word and the plurality of pronunciations in the input text to be predicted, specifically, the pronunciation text contains the target word and the corresponding plurality of pronunciations, and the pronunciation text and the text to be predicted belong to the same field.

It should be noted that there are many common pronunciation corpora and few uncommon pronunciation corpora in the training text set, but in the present solution, a pronunciation text containing a target word is generated based on a text to be predicted, the target word contained therein, and a plurality of pronunciations of the target word, and the pronunciation of the target word in the text to be predicted is determined based on the similarity between the pronunciation text and the text to be predicted, which is not limited by the number of pronunciation corpora in the training text set, and the pronunciation of the predicted target word is not affected by the imbalance of the training corpora, so that the prediction accuracy is higher.

Step S702: acquiring a target word in a text to be predicted, wherein the target word has at least two pronunciations;

step S703: generating a first vector based on the text to be predicted, wherein the first vector represents the content contained in the text to be predicted;

step S704: processing the first vector, the target word and the at least two pronunciations based on a target generator to generate at least two groups of pronunciation text sets;

step S705: and determining the pronunciation of the target word in the text to be predicted based on the similarity between the at least two groups of pronunciation text sets and the text to be predicted.

Steps S702 to 705 are the same as steps S501 to 504 in embodiment 5, and are not described in detail in this embodiment.

In summary, the information processing method provided in this embodiment further includes: training an original learning model based on a training text set, a target polyphone and at least two pronunciations of the target polyphone to obtain a target learning model, wherein the target learning model comprises a target generator and a target discriminator, the training text set comprises the target polyphone and a marked pronunciation, and the marked pronunciation is a pronunciation marked by the target polyphone in the training text. In the scheme, pronunciation is labeled on polyphonic characters contained in a set training text set, an original learning model is trained on the training text set labeled with the pronunciation to obtain a target learning model, a target generator in the target learning model can generate a pronunciation text containing the target characters based on a text to be predicted, the target characters contained in the text to be predicted and a plurality of pronunciations of the target characters, the generated pronunciation text is in accordance with the target characters and the pronunciations in the input text to be predicted, the pronunciation text contains the target characters and the corresponding pronunciations, and the pronunciation text and the text to be predicted belong to the same field, so that a basis is provided for determining the pronunciation of the target characters in the text to be predicted based on the pronunciations text set subsequently.

As shown in fig. 8, a flowchart of embodiment 7 of an information processing method provided by the present application includes the following steps:

step S801: acquiring a training text set, wherein the training text set comprises at least two groups of training texts, and each group of training texts corresponds to one marked pronunciation;

the training text set comprises a plurality of groups of training texts, each group of training texts comprises polyphones with the same pronunciation, and the polyphones in each training text are labeled with the pronunciation.

In a specific implementation, in the training text set, each pronunciation of each polyphone corresponds to a group of training texts, and each group of training texts may include a plurality of training texts.

Step S802: generating a training vector based on the training text;

wherein the training vector represents content contained in the training text.

The training text is processed by adopting a set word2vec network to generate a training vector, wherein the training vector is obtained by processing the whole sentence of the training text and contains all contents contained in the training text.

Specifically, all training texts in the training text set are sequentially processed to generate a corresponding number of training vectors.

Step S803: inputting the training vector, the target polyphone and the marked pronunciation as input conditions into an original generator to obtain a target generator so that the target generator generates a first text;

and the training vector, the target polyphone and the marked pronunciation of the target polyphone are used as input conditions, and the input original generator is used for training the original generation to obtain the target generator. The target generator may further process based on the set of training texts to generate a first text.

Wherein the generated first text is in accordance with the input condition input to the target generator.

Specifically, since the learning model includes two parts: the generator and the discriminator are used for respectively training the two parts in a specific training process, the generator is trained firstly to obtain a target generator, and the discriminator is trained based on the target generator.

Wherein, the inputting the training vector, the target polyphone and the marked pronunciation as input conditions into an original generator to obtain a target generator comprises:

step S01: inputting the training vector, the target polyphone and the marked pronunciation as input conditions into an original generator to generate a second text;

the training vector generated based on the training text, the target polyphones in the training text and the standard pronunciations thereof are input into an original generator as input conditions, and the original generator generates a second text of output contents based on the input conditions.

Because the original generator is not trained, the output content generated based on the input condition is random, and in order to make the output content generated based on the input condition by the generator meet the input condition, the input condition of the original generator needs to be adjusted based on the output content of the original generator.

Step S02: adding a label judged to be false to the second text, taking the second text added with the label as a training text, returning to execute the step of generating a training vector based on the training text until a second appointed training stopping condition is met, and obtaining a target generator;

in the process of training the original generator, if the second text generated by the original generator does not accord with the input condition, the training personnel adds a label judged to be false to the second text, adds the label as a training text to a training text set, continues to use the training text set to generate a training vector, and inputs the training vector to the generator based on the generated training vector so that the generator continues to generate the text.

Where a second contract stop condition is set for the training generator, e.g., the condition is the number of looped training.

For example, the number of training cycles is 100, the number of training based on the training text set reaches 100 in the primitive generator, and the training of the generator is stopped to obtain the target generator.

Step S03: and inputting the training vector, the target polyphone and the marked pronunciation as input conditions into a target generator so that the target generator generates a first text.

The method comprises the steps of using a training vector generated by an original training text, a target polyphone contained in the training vector and a marked pronunciation of the target polyphone as input conditions, inputting the input conditions into a target generator, generating a first text by the target generator, and inputting the first text into an original discriminator subsequently to train the original discriminator.

In the process of training the generator, random signals, training vectors, target polyphones and labeled pronunciations are input into the original generator, and the original generator is trained to obtain the target generator.

Due to the working rules of the generator, in addition to the input conditions, a random signal needs to be input to the raw generator to enable the raw generator to generate content based on the input conditions.

Step S804: taking the first text and any training text as input contents of an original discriminator so that the original discriminator judges the real conditions of the first text and the training text until a first appointed training stopping condition is met, and obtaining a target learning model;

the target learning model comprises a target generator and a target discriminator.

The method comprises the steps of inputting a first text generated by a target generator based on a training text and any training text as input contents into an original discriminator so that the original discriminator judges the real conditions of the first text and the training text.

Since the original discriminator is not trained, the real situation of the judgment may be correct and may be wrong, such as judging the training text to be false.

The first appointed training stopping condition may be set according to circumstances, such as the training times are set, the LOSS function (LOSS) value of the output result is not decreased, and the like.

Specifically, when the original discriminator is trained, each time the original discriminator is trained, the target generator generates a first text, and the first text and the training text are input into the original discriminator as input contents for training.

When the first appointed training stopping condition is the training times, recording the times of training the original discriminator by using the training text and the first text until the training times meet the set times, and finishing the training.

When the first appointed training stopping condition is that the loss function is not reduced any more, recording the judgment result output every time, calculating the loss of the output judgment result based on the loss function until the loss function of the judgment result output every time of training is not reduced any more, and finishing the training.

And finishing the training of the original discriminator to represent that the training of the learning model is finished, and obtaining the target generator and the target discriminator.

Step S805: acquiring a target word in a text to be predicted, wherein the target word has at least two pronunciations;

step S806: generating a first vector based on the text to be predicted, wherein the first vector represents the content contained in the text to be predicted;

step S807: processing the first vector, the target word and the at least two pronunciations based on a target generator to generate at least two groups of pronunciation text sets;

step S808: and determining the pronunciation of the target word in the text to be predicted based on the similarity between the at least two groups of pronunciation text sets and the text to be predicted.

Steps S805 to 808 are the same as steps S702 to 705 in embodiment 6, and are not described in detail in this embodiment.

In summary, in the information processing method provided in this embodiment, training an original learning model based on a training text, a target polyphone and at least two pronunciations of the target polyphone to obtain a target learning model includes: acquiring a training text set, wherein the training text set comprises at least two groups of training texts, and each group of training texts corresponds to one marked pronunciation; generating a training vector based on the training text, wherein the training vector represents content contained in the training text; inputting the training vector, the target polyphone and the marked pronunciation as input conditions into an original generator to obtain a target generator so that the target generator generates a first text; and taking the first text and any training text as input contents of an original discriminator, so that the original discriminator judges the real conditions of the first text and the training text until a first convention stopping training condition is met, and a target learning model is obtained, wherein the target learning model comprises a target generator and a target discriminator. In the scheme, a training process of a target learning model is explained, an original learning model is trained based on the training process to obtain a trained target learning model, and a target generator in the target learning model is used for generating a plurality of pronunciation texts in the subsequent pronunciation prediction of a target word in a text to be predicted.

Corresponding to the embodiment of the information processing method provided by the application, the application also provides an embodiment of a device applying the information processing method.

Fig. 9 is a schematic structural diagram of an embodiment of an information processing apparatus provided in the present application, where the apparatus includes the following structures: an acquisition module 901, a generation module 902 and a determination module 903;

the acquiring module 901 is configured to acquire a target word in a text to be predicted, where the target word has at least two pronunciations;

the generating module 902 is configured to generate at least two sets of pronunciation text sets corresponding to at least two pronunciations based on the target word and the at least two pronunciations, where each set of pronunciation text corresponds to one pronunciation, and each set of pronunciation text includes at least one pronunciation text;

the determining module 903 is configured to determine the pronunciation of a target word in the text to be predicted based on the similarity between the at least two groups of pronunciation text sets and the text to be predicted.

Optionally, the generating module is configured to:

Optionally, the generating module includes:

a vector unit, configured to generate a first vector based on the text to be predicted, where the first vector represents content included in the text to be predicted;

and the generating unit is used for generating at least two groups of pronunciation text sets corresponding to at least two pronunciations based on the first vector, the target word and the at least two pronunciations, each group of pronunciation text set corresponds to one pronunciation, each group of pronunciation text contains at least one pronunciation text, and the content of each pronunciation text is related to the content of the text to be predicted.

Optionally, the determining module includes:

correspondingly generating a vector for each pronunciation text in each group of pronunciation text set, wherein the generated vector for each pronunciation text represents the content contained in the pronunciation text;

Optionally, the generating unit is configured to:

Optionally, the method further includes:

the training module is used for training an original learning model based on a training text set, a target polyphone and at least two pronunciations of the target polyphone to obtain a target learning model, wherein the target learning model comprises a target generator and a target discriminator, the training text set comprises the target polyphone and a marked pronunciation, and the marked pronunciation is the pronunciation marked by the target polyphone in the training text.

Optionally, the training module is specifically configured to:

In the present application, the functions of the information processing apparatus are explained with reference to the method embodiments, which are not described in detail in the present embodiment.

In summary, the present application provides an information processing apparatus, which generates a plurality of pronunciation texts based on polyphones in a text to be predicted, so as to further select pronunciations corresponding to the polyphones in the predicted text based on the plurality of pronunciation texts and the text to be predicted, without being affected by imbalance of training corpus, and achieve higher accuracy of pronunciation of the polyphones.

Corresponding to the embodiment of the information processing method provided by the application, the application also provides the electronic equipment and the readable storage medium corresponding to the information processing method.

Wherein, this electronic equipment includes: a memory, a processor;

wherein, the memory stores a processing program;

Specifically, the information processing method implemented by the electronic device may refer to the embodiment of the information processing method.

Wherein the readable storage medium has stored thereon a computer program, which is called and executed by a processor, implementing the steps of the information processing method according to any one of the preceding claims.

Specifically, the computer program stored in the readable storage medium executes the information processing method, and the information processing method embodiments described above may be referred to.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device provided by the embodiment, the description is relatively simple because the device corresponds to the method provided by the embodiment, and the relevant points can be referred to the method part for description.

The previous description of the provided embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features provided herein.

Claims

1. An information processing method comprising:

acquiring target words in a text to be predicted, wherein the target words have at least two pronunciations;

2. The method of claim 1, generating at least two sets of pronunciation text sets corresponding to at least two pronunciations based on at least the target word and the at least two pronunciations, comprising:

3. The method of claim 2, generating at least two sets of pronunciation text sets corresponding to at least two pronunciations based on the text to be predicted, the target word, and the at least two pronunciations, comprising:

4. The method according to any one of claims 1-3, wherein determining the pronunciation of the target word in the text to be predicted based on the similarity between the at least two sets of pronunciation text sets and the text to be predicted comprises:

5. The method of claim 3, wherein generating at least two sets of pronunciation text sets corresponding to at least two pronunciations based on the first vector, the target word, and the at least two pronunciations comprises:

6. The method of claim 5, before obtaining the target word in the text to be predicted, further comprising:

training an original learning model based on a training text set, a target polyphone and at least two pronunciations of the target polyphone to obtain a target learning model, wherein the target learning model comprises a target generator and a target discriminator, the training text set comprises the target polyphone and a marked pronunciation, and the marked pronunciation is a pronunciation marked by the target polyphone in the training text set.

7. The method of claim 6, wherein the training an original learning model based on a set of training texts, a target polyphone and at least two pronunciations of the target polyphone to obtain a target learning model comprises:

and taking the first text and any training text as input contents of an original discriminator so that the original discriminator judges the truth of the first text and the training text until a first convention stop training condition is met to obtain a target learning model, wherein the target learning model comprises a target generator and a target discriminator.

8. The method of claim 7, wherein inputting the training vector, the target polyphone, and the labeled pronunciation as input conditions into a raw generator results in a target generator, comprising

9. An information processing apparatus comprising:

10. An electronic device, comprising: a memory, a processor;

wherein, the memory stores a processing program;

the processor is used for loading and executing the processing program stored in the memory so as to realize the steps of the information processing method according to any one of claims 1 to 8.