CN110277085B

CN110277085B - Method and device for determining polyphone pronunciation

Info

Publication number: CN110277085B
Application number: CN201910555855.3A
Authority: CN
Inventors: 吴志勇; 代东洋; 康世胤; 苏丹; 俞栋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2021-08-24
Anticipated expiration: 2039-06-25
Also published as: CN110277085A

Abstract

The application provides a method and a device for determining the pronunciation of polyphone, wherein the method comprises the steps of firstly obtaining a character sequence to be detected corresponding to a text to be detected containing a target polyphone; inputting a character sequence to be tested into a first input end of a pre-constructed polyphone disambiguation model; and inputting the sequence corresponding to the target polyphone at the position of the character sequence to be detected into a second input end of the pre-constructed polyphone disambiguation model. And obtaining the predicted pronunciation corresponding to the target polyphone through the polyphone disambiguation model. Because the polyphone disambiguation model has the capability of enabling the predicted pronunciation corresponding to the target polyphone to approach the actual pronunciation of the target polyphone, the pronunciation corresponding to the target polyphone can be accurately obtained.

Description

Method and device for determining polyphone pronunciation

Technical Field

The present application relates to the field of voice communication technologies, and in particular, to a method and an apparatus for determining pronunciations of polyphones.

Background

The word-sound conversion is To convert characters into corresponding pronunciations, and the word-sound conversion can be applied To a plurality of scenes, for example, a Speech synthesis (TTS) application scene, and the accuracy of the word-sound conversion directly influences the intelligibility of the Speech synthesis.

If a character has a certain pronunciation, i.e. the character is not a polyphonic character, the pronunciation can be determined by looking up a dictionary. If a character is a polyphone, how to determine the pronunciation corresponding to the character is a difficult problem for those skilled in the art.

Disclosure of Invention

In view of the above, the present application provides a method and apparatus for determining the pronunciation of polyphones.

In order to achieve the above purpose, the present application provides the following technical solutions:

a method of determining a polyphonic pronunciation, comprising:

acquiring a character sequence to be detected corresponding to a text to be detected containing a target polyphone; the text to be tested comprises a plurality of characters, wherein the character sequence to be tested comprises sequences corresponding to the characters respectively;

inputting the character sequence to be tested into a first input end of a pre-constructed polyphone disambiguation model; inputting a sequence corresponding to the target polyphone at the position of the character sequence to be detected into a second input end of the pre-constructed polyphone disambiguation model;

obtaining a predicted pronunciation corresponding to the target polyphone through the polyphone disambiguation model;

wherein the polyphonic disambiguation model has the capability of trending the predicted pronunciation corresponding to the target polyphonic towards the actual pronunciation of the target polyphonic.

An apparatus for determining the pronunciation of a polyphone, comprising:

the first acquisition module is used for acquiring a character sequence to be detected corresponding to a text to be detected containing a target polyphone; the text to be tested comprises a plurality of characters, wherein the character sequence to be tested comprises sequences corresponding to the characters respectively;

the second acquisition module is used for inputting the character sequence to be detected into a first input end of a pre-constructed polyphone disambiguation model; inputting a sequence corresponding to the target polyphone at the position of the character sequence to be detected into a second input end of the pre-constructed polyphone disambiguation model;

the third acquisition module is used for acquiring the predicted pronunciation corresponding to the target polyphone through the polyphone disambiguation model;

According to the technical scheme, in the method for determining the pronunciations of the polyphones, firstly, a character sequence to be detected corresponding to a text to be detected containing the target polyphones is obtained; inputting a character sequence to be tested into a first input end of a pre-constructed polyphone disambiguation model; and inputting the sequence corresponding to the target polyphone at the position of the character sequence to be detected into a second input end of the pre-constructed polyphone disambiguation model. And obtaining the predicted pronunciation corresponding to the target polyphone through the polyphone disambiguation model. Because the polyphone disambiguation model has the capability of enabling the predicted pronunciation corresponding to the target polyphone to approach the actual pronunciation of the target polyphone, the pronunciation corresponding to the target polyphone can be accurately obtained.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a block diagram of one implementation of a polyphonic disambiguation model provided in an embodiment of the present application;

FIG. 2 is a flow chart of one implementation of a method for determining a polyphonic pronunciation provided by an embodiment of the present application;

fig. 3 is a structural diagram of an implementation manner of a manner of obtaining a character sequence to be detected according to an embodiment of the present application;

FIG. 4 is a block diagram of another implementation of a polyphonic disambiguation model provided in an embodiment of the present application;

FIG. 5 is a diagram illustrating one implementation of obtaining a predicted pronunciation of a target polyphone based on a polyphone disambiguation model according to an embodiment of the present application;

fig. 6a to 6b are structural diagrams of two implementations of the first classifier provided in the present application;

FIG. 7 is a schematic diagram of a process for training a first neural network submodel to obtain a semantic feature extractor according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of one implementation of an apparatus for determining pronunciations of polyphones provided in an embodiment of the present application;

fig. 9 is a block diagram of an implementation manner of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The application provides a method and a device for determining pronunciations of polyphones.

The polyphonic pronunciation determining device may include a polyphonic pronunciation determining device operating in the terminal and a polyphonic pronunciation determining device operating in the background server/platform.

The terminal may be an electronic device such as a desktop, a mobile terminal (e.g., a smartphone), an ipad, etc. The polyphonic pronunciation determining device running in the background server/platform may be one hardware component of the server/platform or may be a functional module or component.

The background server or the platform may be one server, a server cluster composed of a plurality of servers, or a cloud computing service center.

The following describes a method and an apparatus for determining the pronunciation of polyphones according to the present application.

Fig. 1 is a block diagram of an implementation manner of a polyphonic disambiguation model provided in an embodiment of the present application.

As shown in fig. 1, a character sequence to be tested corresponding to a text to be tested containing a target polyphone may be input at a first input terminal of the polyphone disambiguation model 11.

In an optional embodiment, the text to be tested may be a text containing polyphones, such as a chinese text, a japanese text, a korean text, or an english text. A text to be tested includes a plurality of characters, and a character may represent a chinese character, or a japanese character, or a korean character, or an english word, etc.

Polyphones are characters having more than two pronunciations, different pronunciations having different meanings, different usages, and different parts of speech, for example, the pronunciation of the English word "knockout" may be

The term "desert" of the part of speech of the noun or the meaning of "desert-like" of the part of speech of the adjective can be expressed at this time; the pronunciation of the English word "desert" can also be [ di

t]Here, meaning such as "discard" of verb parts of speech; for another example, the pronunciation of the Chinese character "yes" may be "wei 2", which may mean "when" and "yes" in this case; the pronunciation of "yes" may also be "wei 4", which may mean "the object of the action", "give", etc. in this case.

It is to be understood that the text to be tested containing the target polyphone is data that cannot be recognized by the polyphone disambiguation model, and in an alternative embodiment, the text to be tested containing the target polyphone may be converted into a sequence of characters to be tested that can be recognized by the polyphone disambiguation model.

The target polyphone mentioned in the embodiment of the present application is to be distinguished from polyphones included in the sample text, and is not specific to a certain polyphone, and the target polyphone may be any polyphone.

In an optional embodiment, the sequence of characters to be tested corresponding to the text to be tested includes sequences corresponding to a plurality of characters included in the text to be tested, respectively.

In an alternative embodiment, the characters may be converted to their corresponding sequences using one-hot encoding. One-hot encoding, also known as one-bit-efficient encoding, uses an N-bit status register to encode N states, each having a separate register bit and only one of which is active at any one time.

For example, assuming that the dictionary includes 2 ten thousand characters, where "i am" in West's ancient city "is located at the 2 nd thousand-1 th position of the dictionary, then, optionally, the sequence corresponding to" i "may be [0,0,0,0,0,1, 0], i.e., the sequence corresponding to" i "has only the 2 nd thousand-1 th digit value of" 1 ", and the others are all 0.

In an optional embodiment, the sequence corresponding to the character may also be obtained by processing the one-hot code again after obtaining the one-hot code corresponding to the character, which is not limited in this embodiment of the present application.

In an optional embodiment, the text to be tested can also be directly input into the polyphonic disambiguation model, and the polyphonic disambiguation model can convert the text to be tested into the character sequence to be tested which can be recognized by the polyphonic disambiguation model.

Suppose that the character sequence to be measured corresponding to "i am west old" is as follows: [0,0,0,0,0,1, 0] ("i" corresponding sequence), [0,0,0,0,0,0, 0., 0,1,0,0,0,0,0] ("in" corresponding sequence), [0,0,0,0,0,0, 0,. 0., 0,0,1,0,0,0,0, 0] ("old" corresponding sequence), [0,0,0,0,0,0, 0, 0.,. 0,0,0,1,0,0,0, 0] ("all" corresponding sequence), [0,0,0,0,0,0, 0, 0., 0,0,0,0,1,0] ("west" corresponding sequence), [0,0,0,0,0,0, ("West" corresponding sequence); that is, each row vector is a sequence corresponding to one character; the target polyphones are all located at the 4 th position of the character sequence to be tested.

The polyphonic disambiguation model shown in FIG. 1 may output a pronunciation of a target polyphonic character, e.g., "du 1" for "all", based on where the sequence of characters to be tested and the corresponding sequence of the target polyphonic character are located. Where "du" in "du 1" is a letter component of the pronunciation of the target polyphone, "1" is the tone of the target polyphone, for example, "1" indicates "yin" and "2" indicates "yang" and "3" indicates "up" and "4" indicates "de-voicing".

The polyphonic disambiguation model 11 shown in fig. 1 does not include a function of converting the text to be tested into the character sequence to be tested, and alternatively, the polyphonic disambiguation model 11 may include a function of converting the text to be tested into the character sequence to be tested.

The polyphone disambiguation model provided by the embodiment of the application has the capability of enabling the predicted pronunciation corresponding to the target polyphone to approach to the actual pronunciation of the target polyphone, so that the pronunciation of the target polyphone can be accurately predicted on a high probability.

The method for determining the pronunciation of polyphones provided by the present application is described in detail below with reference to fig. 1.

As shown in fig. 2, a flowchart of an implementation manner of a method for determining a polyphonic pronunciation provided by the embodiment of the present application is provided, where the method includes:

step S201: and acquiring a character sequence to be detected corresponding to the text to be detected containing the target polyphones.

The text to be tested comprises a plurality of characters, wherein the character sequence to be tested comprises sequences corresponding to the characters respectively.

In an optional embodiment, there are various ways to obtain a character sequence to be tested corresponding to a text to be tested, and the embodiments of the present application provide, but are not limited to, the following.

The first way to obtain the character sequence to be tested: and coding characters contained in the text to be detected to obtain a character sequence to be detected corresponding to the text to be detected.

In an alternative embodiment, the way the characters are encoded includes, but is not limited to, one-hot encoding.

In an alternative embodiment, after the characters are encoded, the dimension of the sequence corresponding to the characters may be larger, for example, for the one-hot single hot encoding mentioned in fig. 1, if the dictionary includes 2 ten thousand characters, the sequence corresponding to the characters is 1 × 2 ten thousand vectors, the dimension is larger, and in order to speed up the speed of outputting the predicted pronunciation corresponding to the target polyphonic character by the polyphonic character disambiguation model, optionally, the dimension of the sequence corresponding to the characters may be reduced by using the character encoding model.

In an alternative embodiment, the character encoding model may be obtained by training a neural network, and assuming that the character encoding model includes a matrix of 2 ten thousand by 256, after 1 × 2 ten thousand vectors corresponding to the characters pass through the character encoding model, a vector of 1 × 256 may be obtained, and the vector of 1 × 256 may be used as a sequence corresponding to the characters, so that the dimension of the sequence corresponding to the characters is greatly reduced.

In summary, encoding characters included in the text to be tested to obtain the character sequence to be tested corresponding to the text to be tested includes any one of the following methods: coding any character contained in a text to be tested to obtain a sequence corresponding to the character, and using the sequence of the character to be tested; or, encoding any character contained in the text to be detected to obtain an element vector (for example, a vector of 1 × 2 ten thousand) corresponding to the character; the element vector corresponding to the character is input to a character coding model, and a sequence (for example, a vector of 1 × 256) corresponding to the character is obtained through the character coding model, so as to obtain a character sequence to be measured.

In an alternative embodiment, the character encoding model may be a polyphonic disambiguation model or may be independent of the polyphonic disambiguation model.

In an optional embodiment, the sequence of characters to be detected includes sequences corresponding to a plurality of characters included in the text to be detected, that is, a sequence corresponding to a character can at least represent itself, for example, if the text to be detected is "i am at ancient city", then the sequence corresponding to "i" can at least represent "i" the character.

In an alternative embodiment, the sequence corresponding to the character can characterize the association relationship between the characters associated with the character, for example, the character associated with "me" may include "you", "he", etc., and then the sequence corresponding to "me" can also characterize the association relationship between "you" and/or "he".

In an alternative embodiment, a sequence of character correspondences that can characterize associations between characters associated with themselves may be derived based on a character encoding model.

It can be understood that if the sequence corresponding to the character characterizes the association relationship with the associated character, the polyphonic disambiguation model can more accurately obtain the meaning of the character to be characterized.

In summary, a sequence corresponding to a character can characterize itself, and/or can characterize the association relationship between the characters associated with itself.

In an alternative embodiment, if the polyphonic disambiguation model has a function of determining the positions of the sequences corresponding to the characters in the character sequence to be tested, respectively, based on the order of inputting the character sequence to be tested, the first way of obtaining the character sequence to be tested may be adopted.

Still taking fig. 1 as an example, taking the text to be tested as "i am both ancient and western" as an example, the sequence input to the polyphone disambiguation model may be a sequence corresponding to "i", "a sequence corresponding to" ancient "," a sequence corresponding to "all", "a sequence corresponding to" west ", and" an "sequence corresponding to" in turn. The polyphonic disambiguation model may be based on the input order, i.e., the position of each sequence in the sequence of characters to be tested may be determined. Namely, the positions of the sequences are as follows: the sequence corresponding to "me" is the first position, the sequence corresponding to "in" is the second position, the sequence corresponding to "gu" is the third position, and so on, and the details are not repeated here.

In an alternative embodiment, if the polyphonic disambiguation model does not have a function of determining the position of the sequence corresponding to each character in the character sequence to be tested based on the order of inputting the character sequence to be tested, or the polyphonic disambiguation model has a function of determining the position of the sequence corresponding to each character in the character sequence to be tested based on the order of inputting the character sequence to be tested, but the sequences corresponding to the characters are simultaneously input to the first input end of the polyphonic disambiguation model. Then a second way of obtaining the character sequence to be tested may be employed.

The second way to obtain the character sequence to be tested: coding characters contained in the text to be tested to obtain a character vector to be tested; based on the positions of characters in a text to be detected in the text to be detected respectively, obtaining position vectors of the characters in the text to be detected respectively; and obtaining the character sequence to be detected based on the character vector to be detected and the position vectors of the characters in the text to be detected respectively.

The process of encoding the characters contained in the text to be tested to obtain the character vector to be tested may be a process of encoding the characters in a first manner of obtaining a character sequence to be tested, that is, at least two manners, for example, encoding the characters contained in the text to be tested to obtain the character vector to be tested corresponding to the text to be tested includes any one of the following methods: coding any character contained in a text to be tested to obtain a character vector corresponding to the character, and using the character vector to be tested; or, encoding any character contained in the text to be detected to obtain an element vector (for example, a vector of 1 × 2 ten thousand) corresponding to the character; the element vector corresponding to the character is input to a character coding model, and a character vector (for example, a1 × 256 vector) corresponding to the character is obtained through the character coding model, so as to obtain a character vector to be measured. Namely, the character vector to be detected is the character sequence to be detected of the position of the unmarked character in the text to be detected. The same points can be referred to each other and are not described in detail here.

The character vector to be detected corresponding to the text to be detected comprises character vectors corresponding to all characters contained in the text to be detected. Optionally, the character vector corresponding to one character can characterize itself, and/or can characterize the association relationship between the characters associated with itself.

In an alternative embodiment, the position vector corresponding to a character can be expressed as follows.

Still taking the text to be tested as "i am in west ancient", the position vector of "i" may be [1,0,0, 0], "the position vector of" i "may be [0,1,0,0,0,0, a.., 0]," the position vector of "ancient" may be [0,0,1,0,0,0, 0, a., "the position vector of" all "may be [0,0,0,1,0,0, 0, a.," 0], "the position vector of" west "may be [0,0,0,0,1,0, a.," 0], and the position vector of "ann" may be [0,0,0,0,0, 1., "0 ].

In an alternative embodiment, other encoding methods may also be used to obtain the position vector corresponding to the character, which is not limited in this application.

In an alternative embodiment, a character vector corresponding to a character and a position vector corresponding to the character may be added to obtain a sequence corresponding to the character, for example, where "i" sequence [0,0,0, 0., 0,0,0,0,0,1,0] + [1,0,0, 0., 0], [1,0,0,0,0, 0., 0,0,0,1,0 ].

It is understood that the dimensions of the character vector corresponding to a character and the position vector corresponding to the character may be added in the same case.

In an alternative embodiment, a character may be encoded to obtain an initial vector corresponding to the character (the dimension of the initial vector may be different from the dimension of the position vector); and inputting the initial vector corresponding to the character into a position coding model, and outputting the position vector corresponding to the character by the position coding model.

In an alternative embodiment, the position-coding model may be obtained by training a neural network.

In an alternative embodiment, the position vector having the same dimension as the character vector of the character may be obtained by a position coding model, for example, the position coding model includes an M × 256 matrix, where M is the number of columns of the initial position vector. The position vector corresponding to the character obtained by the position coding model is 1 x 256 dimension; in an alternative embodiment, if the character vector of the character is output through the character encoding model, the character vector of the character may be 1 × 256, so that the two dimensions are the same and can be added.

Fig. 3 is a structural diagram of an implementation manner of a second manner of obtaining a character sequence to be detected according to the embodiment of the present application.

As shown in fig. 3, in an alternative embodiment, the text to be tested may be directly input into the character coding model 31 and the position coding model 32, the character coding model 31 may input the character vector corresponding to each character, and the position coding model 32 may input the position vector corresponding to each character. And adding the character vector corresponding to each character and the position vector to obtain a sequence corresponding to the character so as to obtain a character sequence to be detected.

In an optional embodiment, element vectors corresponding to each character included in the text to be tested may be input to the character coding model 31, initial vectors corresponding to each character included in the text to be tested are input to the position coding model 32, the character coding model 31 may obtain a character vector to be tested based on the element vectors corresponding to each character included in the text to be tested, and the position coding model 32 may obtain a position vector in which each character is located in the text to be tested based on the initial vectors corresponding to each character included in the text to be tested.

Step S202: inputting the character sequence to be tested into a first input end of a pre-constructed polyphone disambiguation model; and inputting the sequence corresponding to the target polyphone at the position of the character sequence to be detected into a second input end of the pre-constructed polyphone disambiguation model.

In an optional embodiment, the characteristic extracted by the polyphonic disambiguation model and corresponding to the target polyphonic character can be obtained based on the fact that the sequence corresponding to the target polyphonic character is located at the position of the character sequence to be detected; the method and the device enable the polyphonic disambiguation model to further process the characteristics corresponding to the target polyphonic character, thereby improving the accuracy of the polyphonic disambiguation model for obtaining the predicted pronunciation corresponding to the target polyphonic character.

Step S203: and obtaining the predicted pronunciation corresponding to the target polyphone through the polyphone disambiguation model.

In the method for determining the pronunciations of polyphones provided by the embodiment of the application, firstly, a character sequence to be detected corresponding to a text to be detected containing a target polyphone is obtained; inputting a character sequence to be tested into a first input end of a pre-constructed polyphone disambiguation model; and inputting the sequence corresponding to the target polyphone at the position of the character sequence to be detected into a second input end of the pre-constructed polyphone disambiguation model. And obtaining the predicted pronunciation corresponding to the target polyphone through the polyphone disambiguation model. Because the polyphone disambiguation model has the capability of enabling the predicted pronunciation corresponding to the target polyphone to approach the actual pronunciation of the target polyphone, the pronunciation corresponding to the target polyphone can be accurately obtained.

The implementation of the specific structure of the polyphonic disambiguation model provided in the present application is various, and the present application provides, but is not limited to, the following.

The structure of the first polyphonic disambiguation model is shown in fig. 4, and the polyphonic disambiguation model 11 includes: semantic feature extractor 41, selector 42, first classifier 43.

Wherein, the first input end of the polyphonic disambiguation model 11 is the input end of the semantic feature extractor 41, and the second input end of the polyphonic disambiguation model is the input end of the selector 42. The output of the first classifier 43 is the output of the polyphonic disambiguation model 11.

The functions of the semantic feature extractor 41, the selector 42 and the first classifier 43 are described below, respectively.

In an alternative embodiment, the semantic feature extractor 41, the selector 42, and the first classifier 43 are each a neural network having a structure of a certain layer (e.g., convolutional layer and/or pooling layer and/or fully-connected layer, etc.).

A semantic feature extractor 41, configured to obtain a predicted semantic feature sequence corresponding to a character sequence to be detected based on the character sequence to be detected; the predicted semantic feature sequence comprises predicted semantic features respectively corresponding to a plurality of characters contained in the text to be tested, and the predicted semantic feature corresponding to one character is used for representing the meaning of the character in the text to be tested.

The semantic feature extractor has the capability that the predicted semantic feature sequence corresponding to the character sequence to be detected obtained through prediction tends to be the accurate semantic feature sequence corresponding to the character sequence to be detected.

In an alternative embodiment, "the predicted semantic feature corresponding to a character is used to characterize the meaning of the character in the text to be tested," in "the meaning of the character in the text to be tested" includes but is not limited to the following: the context information of the character, and/or the definition of the character in the text to be tested, and/or the initial consonant, the final sound and the tone of the character.

The context information of the character comprises context information of the character in two directions from front to back and from back to front, and the context information comprises components of a text to be tested to which the character belongs and parts of speech of each participle of the character in the context text.

The components of the text to be tested to which the characters belong refer to subjects, predicates, objects, subjects, determinants and the like to which the characters belong. The part of speech of each participle of a character in the context text, e.g., the character is a verb or a noun.

The explanation of the characters in the text to be tested is illustrated below, and the text to be tested is still taken as an example of "i am in ancient city, west" and "the explanation of" all in the text to be tested means "a large city, and also means a city known as being full of something".

A selector 42, configured to receive predicted semantic features corresponding to multiple characters included in the text to be tested and output by the semantic feature extractor 42; and acquiring the predicted semantic features corresponding to the target polyphones from the predicted semantic feature sequence output by the semantic feature extractor based on the position of the sequence corresponding to the target polyphones in the character sequence to be detected.

In an alternative embodiment, the position of the target polyphone in the text to be tested is the position of the sequence corresponding to the target polyphone in the character sequence to be tested.

Fig. 5 is a schematic diagram of an implementation manner of obtaining a predicted pronunciation of a target polyphone based on a polyphone disambiguation model according to an embodiment of the present application.

The selector 42 receives the predicted semantic feature sequence output by the semantic feature extractor, taking the predicted semantic feature sequence to be determined as "i am in west ancient city" as an example, the selector 42 may receive the predicted semantic feature corresponding to "i", "the predicted semantic feature corresponding to" ancient "," the predicted semantic feature corresponding to "all", "west", and the predicted semantic feature corresponding to "an". And selecting the predicted semantic features corresponding to 'all' from the predicted semantic features.

In an alternative embodiment, the predicted semantic feature corresponding to the character is a high-dimensional vector.

A first classifier 43, configured to receive the predicted semantic features corresponding to the target polyphones output by the selector 42; and obtaining the predicted pronunciation of the target polyphone based on the predicted semantic features corresponding to the target polyphone.

Wherein the first classifier has the ability to predict that the predicted pronunciation of the target polyphone tends towards the actual pronunciation of the target polyphone.

In an alternative embodiment, the first classifier 43 includes at least one or more network layers (e.g., fully connected layer or CNN network layer, etc.), and as shown in fig. 5, the last network layer of the first classifier may be a shared output layer. In the embodiment of the present application, the shared output layer means that different polyphones correspond to the same output layer.

The first classifier 43 may comprise a plurality of nodes, one circle representing one node in fig. 5, each node being used to predict the probability that the polyphonic character is the corresponding pronunciation.

It will be appreciated that since there are many situations where a node predicts the pronunciation of a polyphone, it may result in the predicted pronunciation of the target polyphone being the pronunciation of other polyphones, e.g., "du 1" for "ancient" being predicted as the "di 4" for "ground".

To avoid the above problem, in an alternative embodiment, the first classifier 43 at least includes a non-shared output layer, as shown in fig. 6a to 6b, which are block diagrams of two implementations of the first classifier provided in this application.

As with the first classifier structure shown in fig. 6a, the first classifier includes only the unshared output layer. As shown in fig. 6b, the first classifier may include a first layer network that is a shared network layer (the shared network layer may include at least one layer, such as a full connection layer or a CNN network layer, etc.), and a second layer network that is a non-shared output layer, which is described below.

In the embodiment of the application, the number of the multiple polyphones is the number of the non-shared output layers, wherein the non-shared output layer corresponding to one polyphone comprises a plurality of nodes, and one node corresponds to one pronunciation corresponding to the polyphone; a node is used to predict the probability that the pronunciation of the polyphonic character will be the pronunciation corresponding to the node.

For example, since the polyphone "all" has only two pronunciations, the unshared output layer to which "all" corresponds includes two nodes, one for predicting the probability that the polyphone pronunciations are "du 1" and one for predicting the probability that the polyphone pronunciations are "dou 1".

It can be understood that the unshared output layer corresponding to the polyphone outputs the pronunciation corresponding to the maximum probability, for example, still taking the text to be tested as "i am in west and ancient", and if one node in the unshared output layer predicts that "all" has a pronunciation of "du 1" as 98%, and the other node predicts that "all" has a pronunciation of "dou 1" as 2%, then "all" outputs "du 1".

In summary, in an alternative embodiment, the last layer of the first classifier is a non-shared output layer, and if the first classifier only includes the non-shared output layer (as shown in fig. 6 a), the non-shared output layer corresponding to a polyphonic character a predicts probabilities that pronunciations of the target polyphonic character are pronunciations corresponding to the polyphonic characters a, respectively, based on only the predicted semantic features of the target polyphonic character. If the polyphone A is exactly the target polyphone, the unshared output layer corresponding to the target polyphone only predicts the probability that the pronunciations of the target polyphone are the pronunciations respectively corresponding to the target polyphone based on the predicted semantic features of the target polyphone.

In an alternative embodiment, if the first classifier further includes at least one network layer (e.g., the first network layer shown in fig. 6 b) in addition to the last non-shared output layer, the non-shared output layer corresponding to a polyphonic character a predicts the probability that the pronunciation of the target polyphonic character is the pronunciation corresponding to each polyphonic character a based on the output of the previous layer. If the polyphone A is exactly the target polyphone, the unshared output layer corresponding to the target polyphone predicts the probability that the pronunciations of the target polyphone are the pronunciations respectively corresponding to the target polyphone based on the output of the previous layer.

In summary, if the first classifier 43 can know which unshared output layer the target polyphone corresponds to, only the pronunciation predicted by the unshared output layer corresponding to the target polyphone can be output. Thus, it is possible to avoid the situation where the predicted target polyphone is a pronunciation of another polyphone.

In an optional embodiment, the selector is further configured to determine, from the unshared output layers respectively corresponding to the polyphones, a target unshared output layer corresponding to the target polyphone; the unshared output layer corresponding to the polyphone comprises a plurality of nodes, and one node corresponds to one pronunciation corresponding to the polyphone; a node is used to predict the probability that the pronunciation of the polyphonic character will be the pronunciation corresponding to the node.

In an alternative embodiment, after the first classifier 43 knows which unshared output layer the target polyphone corresponds to, optionally, only the unshared output layer corresponding to the target polyphone can predict the pronunciation of the target polyphone based on the predicted semantic features of the target polyphone. The unshared output layers corresponding to other polyphones cannot predict the pronunciation of the target polyphone based on the predicted semantic features of the target polyphone. Thus, only the unshared output layer corresponding to the target polyphone can output the predicted pronunciation.

In an alternative embodiment, each unshared output layer of the first classifier 43 may predict the pronunciation of the target polyphone based on the predicted semantic features of the target polyphone, but the first classifier 43 outputs only the predicted pronunciation of the unshared output layer corresponding to the target polyphone when outputting the result.

In summary, in the embodiment of the present application, optionally, the first classifier 43 employs a non-shared output layer, so as to avoid the situation that the target polyphone is predicted to be the pronunciation of another polyphone, and in addition, the polyphone disambiguation model can conveniently process the newly added polyphone, for example, only the non-shared output layer corresponding to the polyphone needs to be added to the first classifier.

The structure of the polyphonic disambiguation model of the second kind comprises:

a converter for converting the text to be tested into a character sequence to be tested, a semantic feature extractor 41, a selector 42 and a first classifier 43 shown in fig. 4.

The input end of the converter is used for inputting a text to be tested, and the converter inputs the character sequence to be tested output by the converter to the semantic feature extractor 41.

The converter is also used to obtain the position of the target polyphone in the character sequence to be tested and input the position of the target polyphone in the character sequence to be tested to the input of the selector 42.

In an alternative embodiment, the converter comprises: a character encoding model; in another alternative embodiment, the converter includes a character encoding model as well as a position encoding model (as shown in FIG. 3).

The structure of the third polyphonic disambiguation model is different from that of the first polyphonic disambiguation model in that the first polyphonic disambiguation model comprises at least two neural network models such as a semantic feature extractor and a first classifier, and in the process of training the first polyphonic disambiguation model, at least the neural network model corresponding to the semantic feature extractor is trained to obtain the semantic feature extractor; and training at least the neural network model corresponding to the first classifier to obtain the first classifier. The third polyphonic disambiguation model is derived from training an overall neural network model.

The structure of the fourth polyphonic disambiguation model includes:

a converter mentioned in the second polyphonic disambiguation model and a third trained overall neural network model.

The process of training the neural network to obtain the polyphonic disambiguation model is described below.

The first is the way to train neural networks to get a polyphonic disambiguation model.

If the polyphonic disambiguation model includes at least: semantic feature extractor, selector, and first classifier (e.g., structure of the first or second type of polyphonic disambiguation model), then the process of training the neural network model to obtain the semantic feature extractor includes at least:

the method comprises the following steps: training a first neural network submodel to obtain a semantic feature extractor.

The first step comprises the following steps:

step A1: a plurality of first sample character sequences respectively corresponding to the first samples are obtained.

Wherein, each first sample text at least has one character, and the character sequence of the first sample corresponding to one first sample text comprises the sequence corresponding to a plurality of characters contained in the first sample text respectively.

Assume that the complete first sample text is: the mobile phone can not catch up as long as the mobile phone has electricity and traffic < sep > alone. Where "< sep >" is an identification that characterizes the interval between two statements. In an alternative embodiment, the first sample may comprise at least one statement. If the first sample includes only one statement, then the first sample does not include an identifier that characterizes the interval between the two statements.

The first sample of the at least one character that is missing may be: as long as hand _ still _ and flow _ < sep > lone _ chases me. The incomplete characters in the first sample text are: mechanical, electrical, volume, exclusive, not.

In an alternative embodiment, at least one character in the first sample text may be randomly masked out to obtain the first sample text with at least one character missing.

The first character sequence of the first sample corresponding to the first text can refer to the description of the character sequence to be tested corresponding to the text to be tested, and the description is omitted here because the description of the character sequence to be tested corresponding to the text to be tested is the same.

It will be appreciated that since the first sample text is missing at least one character, the sequence of first sample characters corresponding to the first sample text does not include the sequence corresponding to the missing character. For example, the first sample text is: as long as hand _ still _ and stream _ < sep > lone _ then chases me, the first sample character sequence corresponding to the first sample text does not include: a sequence corresponding to "machine", a sequence corresponding to "electrical", a sequence corresponding to "quantity", a sequence corresponding to "single", and a sequence corresponding to "not".

It will be appreciated that although the first sample text does not include the missing character, the first sample text includes an identifier (e.g., "_") that characterizes the missing character and/or an identifier (e.g., "< sep >") that characterizes the space between the two sentences, so that the first sample character sequence includes a sequence corresponding to the identifier that characterizes the missing character and/or a sequence corresponding to the identifier that characterizes the space between the two sentences.

Still taking the first sample text as "follow me as long as hand _ still _ and stream _ < sep > lone _ then" as an example, the first sample character sequence corresponding to the first sample text includes: the sequence corresponding to "only", "sequence corresponding to", "sequence corresponding to" hand "," sequence corresponding to "_", "sequence corresponding to" still "," sequence corresponding to "present", "sequence corresponding to" and "sequence corresponding to" stream "," sequence corresponding to "_", "< sep >", "sequence corresponding to" arc "," sequence corresponding to "in" and "sequence corresponding to" just "," sequence corresponding to "chase", "sequence corresponding to" up "," sequence corresponding to "i" and "sequence corresponding to" i ".

Step A2: and respectively inputting the plurality of first sample character sequences into a first neural network set, and obtaining the predicted incomplete characters corresponding to the plurality of first sample character sequences through the first neural network set.

Wherein the first set of neural networks comprises: the first neural network submodel is used for predicting semantic features corresponding to all characters contained in the first sample character sequence based on the first sample character sequence; and the second neural network submodel is used for obtaining the predicted incomplete characters in the first sample character sequence based on the predicted semantic features which are respectively corresponding to the characters contained in the first sample character sequence and output by the first neural network submodel.

Fig. 7 is a schematic diagram of a process for training a first neural network sub-model to obtain a semantic feature extractor according to an embodiment of the present disclosure.

In fig. 7, it is assumed that the first sample text is "chase me as long as hand _ still _ and stream _ < sep > lone _ the first sample character sequence includes: the sequence corresponding to "only", "sequence corresponding to", "sequence corresponding to" hand "," sequence corresponding to "_", "sequence corresponding to" still "," sequence corresponding to "present", "sequence corresponding to" and "sequence corresponding to" stream "," sequence corresponding to "_", "< sep >", "sequence corresponding to" arc "," sequence corresponding to "in" and "sequence corresponding to" just "," sequence corresponding to "chase", "sequence corresponding to" up "," sequence corresponding to "i" and "sequence corresponding to" i ".

In an alternative embodiment, the first sample may comprise only one statement; in an alternative embodiment, the first sample may comprise at least two statements.

In an alternative embodiment, the first set of neural networks may include a converter for converting the first sample into a first sequence of sample characters; directly inputting the first sample text into the first neural network set, and automatically obtaining a first sample character sequence; in another alternative embodiment, the first sample word is converted into the first sequence of sample words before being input into the first set of neural networks.

The first neural network submodel 71 may obtain the predicted semantic features corresponding to each character included in the first sample text, as shown in fig. 7, the predicted semantic features corresponding to "only" (indicated by a solid block), "to" correspond (indicated by a cube), "" correspond (indicated by a black cube), "" still "correspond (indicated by a cube)," "have" correspond (indicated by a cube), "" correspond (indicated by a black cube), "and" correspond (indicated by a cube), "" flow "correspond (indicated by a cube)," "correspond (indicated by a black cube)," "correspond (indicated by a cube)," "correspond (indicated by a black cube)," correspond, The expression "comprises a prediction semantic feature (represented by a cube)" corresponding to < sep > "," lone "corresponding to a prediction semantic feature (represented by a cube)," _ "corresponding to a prediction semantic feature (represented by a darkened cube)," just "corresponding to a prediction semantic feature (represented by a cube)," catch up "corresponding to a prediction semantic feature (represented by a cube)," _ "corresponding to a prediction semantic feature (represented by a darkened cube)," up "corresponding to a prediction semantic feature (represented by a cube)," my "corresponding to a prediction semantic feature (represented by a cube).

The second neural network submodel 72 predicts the incomplete characters based on the semantic features of the first sample text, which correspond to the characters, as shown in fig. 7, and the second neural network submodel 72 outputs "machine, electricity, quantity, single, not".

Step A3: and aiming at each first sample character sequence, training the first neural network set at least based on the predicted incomplete character and the real incomplete character corresponding to the first sample character sequence so as to obtain the semantic feature extractor corresponding to the first neural network submodel.

Optionally, the second neural network submodel corresponds to a second classifier.

It can be understood that, since the second neural network submodel 72 only has the predicted semantic features corresponding to the characters included in the first sample text, obtains the missing characters by prediction, and does not have the function of obtaining the predicted semantic features corresponding to the characters included in the first sample text, after continuous training, it can be ensured that the output result of the last hidden layer of the first neural network submodel includes the predicted semantic features corresponding to the characters included in the first sample text.

If only the first set of neural networks is required to output the predicted malformed character, then the first sample may include one or more statements; if the first neural network set is required to output the prediction association information whether at least two statements contained in the first sample text are associated contexts besides the first neural network set is required to output the prediction malformed characters, the first sample text at least comprises two statements.

In an alternative embodiment, in order to further ensure that the first neural network submodel can output the predicted semantic features corresponding to the characters included in the first sample, each of the first samples input to the first neural network set includes at least two statements.

Step a2 further includes: for each first sample text, obtaining, through the first neural network set, prediction association information that characterizes whether at least two sentences contained in the first sample text are associated contexts.

In an alternative embodiment, the predicted association information corresponding to a first sample is whether at least two statements included in the first sample are associated contexts. For example, "yes" represents that at least two statements contained in the first sample are associated contexts, and "no" represents that at least two statements contained in the first sample are not associated contexts.

In an alternative embodiment, for each first sample text, the first neural network submodel may have a function of obtaining predicted association information whether at least two statements contained in the first sample text are associated contexts; in another alternative embodiment, for each first sample, the second neural network submodel may have the function of deriving predictive relevance information whether at least two statements contained in the first sample are associated contexts. As shown in fig. 7.

When the corresponding step a3 is executed to train the first neural network set based on at least the predicted incomplete character and the actual incomplete character corresponding to the first sample character sequence, specifically:

and training the first neural network set based on the predicted associated information and the actual associated information corresponding to the first sample character sequence, and the predicted incomplete character and the actual incomplete character corresponding to the first sample character sequence.

In conclusion, in the process of training to obtain the semantic feature set, the first neural network set is trained based on the comparison result of the predicted incomplete characters and the real incomplete characters of the first sample text; alternatively, the first set of neural networks is trained based on a comparison of predicted incomplete characters and actual incomplete characters of the first sample text and on a comparison of predicted associated information and actual associated information of at least two sentences contained in the first sample text.

In an optional embodiment, before training the first neural network set, the actual incomplete characters corresponding to each first sample text, which contains actual association information of at least two sentences, need to be manually labeled.

In an alternative embodiment, the first set of neural networks may be input by randomly masking out some characters of the first sample text with a computer before training the first set of neural networks. It can be understood that, since the computer knows which characters are masked out in the first sample text before the computer randomly masks out some characters in the first sample text, the computer can be used to obtain the actual incomplete characters in the first sample text without manually labeling the actual incomplete characters.

In an alternative embodiment, the annotation of the actual association information of the at least two sentences contained in the first sample text can also be obtained by using a computer. For example, a first sample text composed of two continuous sentences is found in an article, and the computer can obtain the actual association information of the two sentences contained in the first sample text as "yes"; if the first sample text composed of two sentences found from different paragraphs of an article or the first sample text composed of two sentences found from different articles, the computer can obtain the actual association information of the two sentences contained in the first sample text as "no".

In conclusion, the computer can substitute for manual work to complete the actual incomplete characters corresponding to each first sample text, and the first sample text contains the actual associated information of at least two sentences, so that no manual marking is needed, and the unsupervised data training for the first neural network set is realized.

It can be understood that the pronunciation of the polyphone is determined by the semantics under the context, and at present, the context information of the polyphone needs to be labeled manually, and then the neural network is trained based on the labeled sample text to obtain the polyphone disambiguation model. These ways of labeling (e.g., part-of-speech labeling) the context information of polyphones are often cumbersome and expensive. Compared with the labeling work, the method for labeling by using the computer is simpler and saves time. When the computer marks, only simple marking is carried out, for example, real incomplete characters and actual associated information are marked, so that the computer marking is fast and accurate.

Because the mode of labeling the context information of the polyphones is complicated and expensive, the obtained sample texts are limited, and the polyphone disambiguation model obtained by training the neural network based on the limited sample texts is difficult to learn the semantic features effective to the polyphone disambiguation from the limited sample texts. Because the computer is used for labeling, a large amount of sample texts can be obtained; the first neural network submodel is trained based on a large number of sample texts to obtain a semantic feature extractor, so that semantic features effective for polyphonic disambiguation can be easily learned.

In the process of training the first neural network submodel to obtain the semantic feature extractor, training is not carried out based on context information corresponding to a certain polyphone, but is carried out based on a sample text containing or not containing the polyphone, and the trained first neural network set can accurately predict incomplete characters corresponding to the first sample text, so that the semantic feature extractor can accurately extract the semantic features of any character, and the pronunciation of all the polyphones can be predicted by using a unified polyphone disambiguation model containing the semantic feature extractor. There is no need to train a model for each polyphone separately.

In summary, the first sample text may not include polyphonic characters, and may include polyphonic characters.

In summary, the training process of the first neural network submodel is different from the current training process of the sample text containing polyphones, and the context information and the like of the polyphones do not need to be labeled.

After the training of the first neural network submodel to obtain the semantic feature extractor, a third neural network submodel may be trained to obtain a first classifier.

And after the semantic feature extractor is obtained from the first neural network submodel after training is finished, obtaining a second neural network set comprising the semantic feature extractor, the selector and a third neural network submodel. Wherein the first input end of the second neural network set is the input end of the semantic feature extractor; the second input of the second set of neural networks is an input of the selector.

Step two: a second set of neural networks is trained to derive a first classifier.

In an alternative embodiment, step two includes:

step B1: and acquiring second sample character sequences respectively corresponding to the plurality of second sample texts.

Wherein each second sample text comprises a plurality of characters; the second sample character sequence corresponding to the second sample text comprises a sequence corresponding to a plurality of characters contained in the second sample text respectively.

The second sample character sequence corresponding to the second sample text may refer to the description of the character sequence to be detected corresponding to the text to be detected, and both are the same, and will not be described here.

In an alternative embodiment, the second set of neural networks may include a converter for converting the second sample text into a second sample character sequence; directly inputting the second sample text into a second neural network set, and automatically obtaining a second sample character sequence; in another alternative embodiment, the second sample text is converted into the second sample character sequence before being input into the second neural network set.

Step B2: for a second sample character sequence corresponding to each second sample text, inputting the second sample character sequence to a first input end of a second neural network set; inputting the position of the sequence corresponding to the polyphone contained in the second sample text in the second sample character sequence into a second input end of the second neural network set; and acquiring the predicted pronunciation corresponding to the polyphone contained in the second sample text through the second neural network set.

Step B3: and aiming at each second sample character sequence, training the second neural network set based on the predicted pronunciation and the actual pronunciation of the polyphone corresponding to the second sample character sequence to obtain the polyphone disambiguation model, thereby obtaining the first classifier.

In conclusion, in the process of training the second neural network set, the actual pronunciation of the polyphone is only needed to be labeled on the second sample text, the context information of the polyphone is not needed to be labeled, the labeling is simple, and the labeling time is saved.

In an alternative embodiment, the specific implementation manner of step B3 includes, but is not limited to, the following two types:

the first method comprises the following steps: and aiming at each second sample character sequence, training the third neural network submodel based on the predicted pronunciation and the actual pronunciation of the polyphone corresponding to the second sample character sequence to obtain the polyphone disambiguation model, wherein the third neural network submodel corresponds to the first classifier.

Because the semantic feature extractor is trained, the semantic feature extractor does not need to be trained again in the process of training the second neural network set, and only the third neural network submodel can be trained to obtain the first classifier.

In an alternative embodiment, the selector may be software or hardware derived, or may not be trained via a neural network, since the selector has a simple function. In another alternative embodiment, the selector may also be obtained by training a neural network.

And the second method comprises the following steps: and aiming at each second sample character sequence, training the semantic feature extractor and the second neural network submodel based on the predicted pronunciation and the actual pronunciation of the polyphone corresponding to the second sample character sequence to obtain the polyphone disambiguation model.

Although the semantic feature extractor has been trained, in the process of training the second neural network set, the semantic feature extractor may be trained again, and the third neural network submodel may be trained to obtain the final semantic feature extractor and the first classifier.

In the embodiment of the present invention, the process of updating the parameters in the first neural network set and the second neural network set is performed in a backward direction from back to front, so the updating process is also referred to as a back propagation process (backpropagation) of the neural network.

Optionally, the first, second, and third Neural Network submodels may select a fully-connected Neural Network (e.g., an MLP Network, where the MLP represents a Multi-layer Perceptron, which is the meaning of a multilayer Perceptron), or may select a Convolutional Neural Network (CNN), a deep Neural Network, etc., for example, VGG16, or may also adopt a Neural Network combining an attention mechanism and a fully-connected Network; a recurrent neural network, such as a Long Short-Term Memory network (LSTM), may also be used, which is not limited by the embodiments of the present application.

The second way to train neural networks to derive polyphonic disambiguation models.

It can be understood that in the first way of training the neural network to obtain the polyphonic disambiguation model, the semantic feature extractor and the first classifier are trained separately and finally combined together; in an alternative embodiment, the polyphonic disambiguation model may be obtained by training an overall neural network, i.e., the structure of the third and fourth polyphonic disambiguation models.

The method for training the neural network to obtain the polyphonic disambiguation model comprises the following steps:

the method comprises the following steps: acquiring third sample character sequences respectively corresponding to the plurality of third sample texts; wherein each third sample text comprises a plurality of characters; the third sample character sequence corresponding to a third sample text includes sequences corresponding to a plurality of characters included in the third sample text, respectively.

The third sample character sequence corresponding to the third sample text may refer to the description of the character sequence to be detected corresponding to the text to be detected, and both are the same, and will not be described here.

Step two: inputting a third sample character sequence corresponding to each third sample text into a first input end of a third neural network model; inputting the position of the sequence corresponding to the polyphone contained in the third sample text in the character sequence of the third sample to the second input end of the third neural network model; and obtaining the predicted pronunciation corresponding to the polyphone contained in the third sample text through the third neural network model.

Step three: and aiming at each third sample character sequence, training the third neural network model based on the predicted pronunciation and the actual pronunciation of the polyphone corresponding to the third sample character sequence to obtain the polyphone disambiguation model.

The method is described in detail in the embodiments disclosed in the present application, and the method of the present application can be implemented by various types of apparatuses, so that an apparatus is also disclosed in the present application, and the following detailed description is given of specific embodiments.

As shown in fig. 8, a block diagram of an implementation manner of an apparatus for determining pronunciations of polyphones provided in an embodiment of the present application, the apparatus includes:

a first obtaining module 81, configured to obtain a to-be-detected character sequence corresponding to a to-be-detected text containing a target polyphone; the text to be tested comprises a plurality of characters, wherein the character sequence to be tested comprises sequences corresponding to the characters respectively;

a second obtaining module 82, configured to input the character sequence to be detected into a first input end of a pre-constructed polyphonic disambiguation model; inputting a sequence corresponding to the target polyphone at the position of the character sequence to be detected into a second input end of the pre-constructed polyphone disambiguation model;

a third obtaining module 83, configured to obtain, through the polyphonic disambiguation model, a predicted pronunciation corresponding to the target polyphonic;

Optionally, the polyphone disambiguation model includes a semantic feature extractor, a selector, and a first classifier, where the first input end is an input end of the semantic feature extractor, and the second input end is an input end of the selector; the third acquisition module includes:

the first acquisition unit is used for acquiring a predicted semantic feature sequence corresponding to the character sequence to be detected through the semantic feature extractor; the predicted semantic feature sequence comprises predicted semantic features corresponding to the characters respectively, and the predicted semantic feature corresponding to one character is used for representing the meaning of the character in the text to be tested;

the semantic feature extractor has the capability of predicting that the predicted semantic feature sequence corresponding to the character sequence to be detected tends to the accurate semantic feature sequence corresponding to the character sequence to be detected;

the second acquisition unit is used for acquiring the predicted semantic features corresponding to the target polyphones from the predicted semantic feature sequence output by the semantic feature extractor based on the position of the sequence corresponding to the target polyphones in the character sequence to be detected through the selector, and inputting the predicted semantic features to the first classifier;

a third obtaining unit, configured to obtain, by the first classifier, a predicted pronunciation of the target polyphone based on a predicted semantic feature corresponding to the target polyphone;

Optionally, the first classifier at least includes a non-shared output layer corresponding to each polyphone, and further includes:

the first determining module is used for determining a target unshared output layer corresponding to the target polyphone from unshared output layers corresponding to the polyphones respectively through the selector; the unshared output layer corresponding to the polyphone comprises a plurality of nodes, and one node corresponds to one pronunciation corresponding to the polyphone; a node for predicting the probability that the pronunciation of the polyphonic character is the pronunciation corresponding to the node;

the third obtaining unit includes:

a first obtaining subunit, configured to obtain, at least through the target unshared output layer in the first classifier, a predicted pronunciation of the target polyphone.

Optionally, the first obtaining module includes:

a fourth obtaining unit, configured to obtain a character vector to be detected based on the text to be detected, where the character vector to be detected includes character vectors corresponding to the multiple characters, respectively;

a fifth obtaining unit, configured to obtain position vectors that the multiple characters are located in the text to be tested, based on positions that the multiple characters are located in the text to be tested, respectively;

and the sixth obtaining unit is used for obtaining the character sequence to be detected based on the character vector to be detected and the position vectors of the characters in the text to be detected respectively.

Optionally, the method further includes:

the fourth obtaining module is used for obtaining first sample character sequences corresponding to a plurality of first samples respectively; wherein each first sample text at least lacks one character, and the character sequence of the first sample corresponding to one first sample text comprises the sequences corresponding to a plurality of characters contained in the first sample text respectively;

the fifth obtaining module is used for respectively inputting the plurality of first sample character sequences into a first neural network set, and obtaining predicted incomplete characters corresponding to the plurality of first sample character sequences through the first neural network set;

wherein the first set of neural networks comprises: the first neural network submodel is used for predicting semantic features corresponding to all characters contained in the first sample character sequence based on the first sample character sequence; the second neural network submodel is used for obtaining a predicted incomplete character in the first sample character sequence based on the predicted semantic features corresponding to the characters contained in the first sample character sequence output by the first neural network submodel;

and the first training module is used for training the first neural network set at least based on the predicted incomplete character and the real incomplete character corresponding to each first sample character sequence so as to obtain the semantic feature extractor corresponding to the first neural network submodel.

Optionally, each first sample includes at least two statements; further comprising:

a sixth obtaining module, configured to obtain, for each first sample text, prediction association information that represents whether at least two statements included in the first sample text are associated contexts through the first neural network set;

the first training module comprising:

and the first training unit is used for training the first neural network set based on the prediction related information and the actual related information corresponding to the first sample character sequence, and the prediction incomplete character and the actual incomplete character corresponding to the first sample character sequence.

Optionally, the method further includes:

a seventh obtaining module, configured to obtain second sample character sequences corresponding to the multiple second sample texts, respectively; wherein each second sample text comprises a plurality of characters; the second sample character sequence corresponding to a second sample text comprises sequences corresponding to a plurality of characters contained in the second sample text respectively;

the eighth obtaining module is configured to input the second sample character sequence to the first input end of the second neural network set for the second sample character sequence corresponding to each second sample text; inputting the position of the sequence corresponding to the polyphone contained in the second sample text in the second sample character sequence into a second input end of the second neural network set; obtaining a predicted pronunciation corresponding to the polyphone contained in the second sample text through the second neural network set;

and the second training module is used for training the second neural network set based on the predicted pronunciation and the actual pronunciation of the polyphone corresponding to each second sample character sequence so as to obtain the polyphone disambiguation model.

Optionally, the second set of neural networks includes: the semantic feature extractor, the selector and a third neural network submodel; wherein the first input end of the second neural network set is the input end of the semantic feature extractor; a second input of the second set of neural networks is an input of the selector; the second training module includes:

the second training unit is used for training the second neural network submodel based on the predicted pronunciation and the actual pronunciation of the polyphonic characters corresponding to the second sample character sequence to obtain the polyphonic character disambiguation model, and the third neural network submodel corresponds to the first classifier;

or the like, or, alternatively,

and the third training unit is used for training the semantic feature extractor and the third neural network submodel based on the predicted pronunciation and the actual pronunciation of the polyphone corresponding to the second sample character sequence so as to obtain the polyphone disambiguation model.

As shown in fig. 9, which is a structural diagram of an implementation manner of an electronic device provided in an embodiment of the present application, the electronic device includes:

a memory 91 for storing a program;

a processor 92 configured to execute the program, the program being specifically configured to:

The processor 92 may be a central processing unit CPU, or an application Specific Integrated circuit ASIC, or one or more Integrated circuits configured to implement embodiments of the present invention.

The first server may further comprise a communication interface 93 and a communication bus 94, wherein the memory 91, the processor 92 and the communication interface 93 are in communication with each other via the communication bus 94.

Alternatively, the communication interface may be an interface of a communication module, such as an interface of a GSM module.

Embodiments of the present invention further provide a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps included in any of the above-mentioned embodiments of the method for determining pronunciations of polyphones.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device or system type embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of determining the pronunciation of a polyphone, comprising:

inputting the character sequence to be tested into a first input end of a pre-constructed polyphone disambiguation model; inputting a sequence corresponding to the target polyphone at the position of the character sequence to be detected into a second input end of the pre-constructed polyphone disambiguation model; the polyphonic disambiguation model has the capability of enabling the predicted pronunciation corresponding to the target polyphonic to approach the actual pronunciation of the target polyphonic; the polyphone disambiguation model comprises a semantic feature extractor, a selector and a first classifier, wherein the first input end is the input end of the semantic feature extractor, and the second input end is the input end of the selector;

obtaining a predicted semantic feature sequence corresponding to the character sequence to be detected through the semantic feature extractor; the predicted semantic feature sequence comprises predicted semantic features corresponding to the characters respectively, and the predicted semantic feature corresponding to one character is used for representing the meaning of the character in the text to be tested;

acquiring a predicted semantic feature corresponding to the target polyphone from a predicted semantic feature sequence output by the semantic feature extractor based on the position of the sequence corresponding to the target polyphone at the character sequence to be detected by the selector, and inputting the predicted semantic feature to the first classifier;

and obtaining the predicted pronunciation of the target polyphone by the first classifier based on the predicted semantic features corresponding to the target polyphone.

2. The method of claim 1, wherein the first classifier comprises at least one unshared output layer corresponding to each polyphone; further comprising:

determining a target unshared output layer corresponding to the target polyphone from unshared output layers corresponding to the polyphones respectively through the selector; the unshared output layer corresponding to the polyphone comprises a plurality of nodes, and one node corresponds to one pronunciation corresponding to the polyphone; a node for predicting the probability that the pronunciation of the polyphonic character is the pronunciation corresponding to the node;

the obtaining, by the first classifier, a predicted pronunciation of the target polyphone based on the predicted semantic features corresponding to the target polyphone includes:

obtaining a predicted pronunciation of the target polyphone at least through the target unshared output layer in the first classifier.

3. The method of claim 1, wherein the obtaining the test character sequence corresponding to the test text containing the target polyphone comprises:

obtaining a character vector to be detected based on the text to be detected, wherein the character vector to be detected comprises character vectors corresponding to the characters respectively;

obtaining position vectors of the characters in the text to be detected respectively based on the positions of the characters in the text to be detected respectively;

and obtaining the character sequence to be detected based on the character vector to be detected and the position vectors of the characters in the text to be detected respectively.

4. The method of determining polyphonic pronunciations as in claim 1, further comprising:

obtaining a plurality of first sample character sequences respectively corresponding to the first samples; wherein each first sample text at least lacks one character, and the character sequence of the first sample corresponding to one first sample text comprises the sequences corresponding to a plurality of characters contained in the first sample text respectively;

respectively inputting the plurality of first sample character sequences into a first neural network set, and obtaining predicted incomplete characters corresponding to the plurality of first sample character sequences through the first neural network set;

and aiming at each first sample character sequence, training the first neural network set at least based on the predicted incomplete character and the real incomplete character corresponding to the first sample character sequence so as to obtain the semantic feature extractor corresponding to the first neural network submodel.

5. The method of claim 4, wherein each first sample text comprises at least two sentences; further comprising:

for each first sample text, obtaining prediction association information which characterizes whether at least two sentences contained in the first sample text are associated contexts through the first neural network set;

the training of the first neural network set based on at least the predicted incomplete character and the real incomplete character corresponding to the first sample character sequence includes:

6. The method for determining polyphonic pronunciations according to claim 1, 4 or 5, further comprising:

acquiring second sample character sequences respectively corresponding to the plurality of second sample texts; wherein each second sample text comprises a plurality of characters; the second sample character sequence corresponding to a second sample text comprises sequences corresponding to a plurality of characters contained in the second sample text respectively;

for a second sample character sequence corresponding to each second sample text, inputting the second sample character sequence to a first input end of a second neural network set; inputting the position of the sequence corresponding to the polyphone contained in the second sample text in the second sample character sequence into a second input end of the second neural network set; obtaining a predicted pronunciation corresponding to the polyphone contained in the second sample text through the second neural network set;

and aiming at each second sample character sequence, training the second neural network set based on the predicted pronunciation and the actual pronunciation of the polyphone corresponding to the second sample character sequence to obtain the polyphone disambiguation model.

7. The method of determining polyphonic pronunciations of claim 6, wherein the second set of neural networks comprises: the semantic feature extractor, the selector and a third neural network submodel; wherein the first input end of the second neural network set is the input end of the semantic feature extractor; a second input of the second set of neural networks is an input of the selector;

the training of the second neural network set based on the predicted pronunciation and the actual pronunciation of the polyphonic character corresponding to the second sample character sequence to obtain the polyphonic character disambiguation model includes any one of the following:

training the second neural network submodel based on the predicted pronunciation and the actual pronunciation of the polyphonic characters corresponding to the second sample character sequence to obtain the polyphonic character disambiguation model, wherein the third neural network submodel corresponds to the first classifier;

or the like, or, alternatively,

and training the semantic feature extractor and the third neural network sub-model based on the predicted pronunciation and the actual pronunciation of the polyphone corresponding to the second sample character sequence to obtain the polyphone disambiguation model.

8. An apparatus for determining the pronunciation of a polyphone, comprising:

wherein the polyphonic disambiguation model has the capability of trending the predicted pronunciation corresponding to the target polyphonic towards the actual pronunciation of the target polyphonic;

the polyphone disambiguation model comprises a semantic feature extractor, a selector and a first classifier, wherein the first input end is the input end of the semantic feature extractor, and the second input end is the input end of the selector; the third acquisition module includes:

and the third acquisition unit is used for acquiring the predicted pronunciation of the target polyphone based on the predicted semantic features corresponding to the target polyphone through the first classifier.

9. An electronic device, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the method of determining a polyphonic pronunciation as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of determining a polyphonic pronunciation according to any one of claims 1 to 7.