CN114550691A - Multi-tone word disambiguation method and device, electronic equipment and readable storage medium - Google Patents

Multi-tone word disambiguation method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN114550691A
CN114550691A CN202210086347.7A CN202210086347A CN114550691A CN 114550691 A CN114550691 A CN 114550691A CN 202210086347 A CN202210086347 A CN 202210086347A CN 114550691 A CN114550691 A CN 114550691A
Authority
CN
China
Prior art keywords
target
polyphonic
characters
disambiguation
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210086347.7A
Other languages
Chinese (zh)
Inventor
李睿端
李健
武卫东
陈明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sinovoice Technology Co Ltd
Original Assignee
Beijing Sinovoice Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sinovoice Technology Co Ltd filed Critical Beijing Sinovoice Technology Co Ltd
Priority to CN202210086347.7A priority Critical patent/CN114550691A/en
Publication of CN114550691A publication Critical patent/CN114550691A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method, a device, electronic equipment and a readable storage medium for disambiguating polyphone, which relate to the technical field of voice processing and comprise the following steps: dividing a text to be processed into a plurality of characters, wherein the plurality of characters comprise target polyphone characters and non-target polyphone characters; aiming at each character, acquiring a first identifier corresponding to the character; inputting the characters and the first identifications corresponding to the characters into a pre-generated target polyphonic character disambiguation model, and determining the pronunciation of the target polyphonic character according to the output of the target polyphonic character disambiguation model. The method and the device are applied to scenes for realizing the multi-tone word disambiguation in a voice synthesis system, and the target multi-tone word disambiguation model is utilized to perform the multi-tone word disambiguation on the target multi-tone word in the text to be processed, so that the prediction speed of the multi-tone word disambiguation under the scenes is improved, and further, the effect of the multi-tone word disambiguation is improved.

Description

Multi-tone word disambiguation method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a method and an apparatus for disambiguating polyphonic characters, an electronic device, and a readable storage medium.
Background
A Text To Speech (TTS) technology is a technology for converting an arbitrary Text into Speech by using a computer. The input text needs to be converted into corresponding pronunciation, wherein, whether the polyphone conversion is correct or not greatly affects the understanding of the user to the synthesized voice, and if the polyphone conversion is wrong, the effect of voice synthesis is greatly reduced. Thus, polyphonic disambiguation is an important task in speech synthesis systems.
The existing method for implementing the disambiguation of the polyphone is based on a decision tree, a maximum entropy algorithm and expert knowledge (a large number of rules), however, the method based on the decision tree is to preset some problems and give a final probability value to all possible pronunciations according to the problems and the preset probability; the maximum entropy model is a classification model designed based on the maximum entropy principle, and has a great requirement on data volume, however, if the sample data is large, the calculation amount is increased, so that the method has certain application limit; the expert knowledge (large number of rules) based approach can have problems of laborious maintenance and easy conflict or interaction between rules; therefore, the method for realizing the disambiguation of the polyphone in the prior art has the problem of poor disambiguation effect of the polyphone. Therefore, a method for disambiguating polyphonic characters is needed to solve the problem of poor disambiguating effect of polyphonic characters in the prior art.
Disclosure of Invention
To overcome the problems in the related art, the present application provides a method, an apparatus, an electronic device and a readable storage medium for disambiguating polyphone.
According to a first aspect of embodiments herein, there is provided a method of polyphonic disambiguation, the method comprising:
dividing a text to be processed into a plurality of characters, wherein the plurality of characters comprise target polyphone characters and non-target polyphone characters;
aiming at each character, acquiring a first identifier corresponding to the character;
inputting the characters and the first identifications corresponding to the characters into a pre-generated target polyphonic character disambiguation model, and determining the pronunciation of the target polyphonic character according to the output of the target polyphonic character disambiguation model.
Optionally, before the step of dividing the text to be processed into a plurality of characters, where the plurality of characters include a target polyphonic character and a non-target polyphonic character, the method further includes:
generating a target polyphone disambiguation model in advance;
and acquiring a text to be processed.
Optionally, the pre-generating a target polyphonic disambiguation model comprises:
acquiring a training sample, wherein the training sample comprises a plurality of sample texts and marking information of target polyphonic characters in the plurality of sample texts, and the marking information is used for indicating pronunciation of the target polyphonic characters in the sample texts and second identification corresponding to the pronunciation of the target polyphonic characters in the sample texts;
and taking the sample texts as input, taking the marking information of the target polyphone characters in the sample texts as an output target, training a preset initial model, and determining the trained model as a target polyphone disambiguation model.
Optionally, the determining the pronunciation of the target polyphonic character from the output of the target polyphonic disambiguation model comprises:
obtaining the marking information of the target polyphone character according to the output of the target polyphone disambiguation model;
and determining the pronunciation of the target polyphonic character according to the labeling information of the target polyphonic character.
According to a second aspect of embodiments of the present application, there is provided an apparatus for polyphonic disambiguation, the apparatus comprising:
the device comprises a dividing module, a processing module and a processing module, wherein the dividing module is used for dividing a text to be processed into a plurality of characters, and the plurality of characters comprise target polyphonic characters and non-target polyphonic characters;
the first identification acquisition module is used for acquiring a first identification corresponding to each character;
and the polyphone disambiguation module is used for inputting the characters and the first identifications corresponding to the characters into a pre-generated target polyphone disambiguation model and determining the pronunciation of the target polyphone characters according to the output of the target polyphone disambiguation model.
Optionally, the apparatus further comprises:
the polyphone disambiguation model training module is used for generating a target polyphone disambiguation model in advance;
and the to-be-processed text acquisition module is used for acquiring the to-be-processed text.
Optionally, the polyphonic disambiguation model training module comprises:
the training sample acquisition unit is used for acquiring a training sample, wherein the training sample comprises a plurality of sample texts and marking information of target polyphonic characters in the plurality of sample texts, and the marking information is used for indicating pronunciation of the target polyphonic characters in the sample texts and second identification corresponding to the pronunciation of the target polyphonic characters in the sample texts;
and the polyphone disambiguation model training unit is used for taking the plurality of sample texts as input, taking the marking information of the target polyphone characters in the plurality of sample texts as an output target, training a preset initial model, and determining the trained model as the target polyphone disambiguation model.
Optionally, the polyphonic disambiguation module further comprises:
the label information acquisition unit is used for acquiring label information of the target polyphonic characters according to the output of the target polyphonic disambiguation model;
and the pronunciation acquisition unit of the target polyphonic character is used for determining the pronunciation of the target polyphonic character according to the labeling information of the target polyphonic character.
According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, including:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of polyphonic disambiguation.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium having instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the method of polyphonic disambiguation.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
the method comprises the steps of dividing a text to be processed into a plurality of characters, wherein the plurality of characters comprise target polyphone characters and non-target polyphone characters; aiming at each character, acquiring a first identifier corresponding to the character; inputting the characters and the first identifications corresponding to the characters into a pre-generated target polyphonic character disambiguation model, and determining the pronunciation of the target polyphonic character according to the output of the target polyphonic character disambiguation model. According to the technical scheme provided by the embodiment of the application, the target polyphone disambiguation model is utilized to perform polyphone disambiguation on the target polyphone character in the text to be processed, so that the speed of predicting the polyphone disambiguation is increased, and further, the effect of the polyphone disambiguation is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flow diagram illustrating a method of polyphonic disambiguation in accordance with an exemplary embodiment;
FIG. 2 is a flow diagram illustrating another method of polyphonic disambiguation in accordance with an exemplary embodiment;
FIG. 3 is a flowchart illustrating a step 201 in a flowchart of another method of polyphonic disambiguation illustrated in FIG. 2 in accordance with an exemplary embodiment;
FIG. 4 is a flowchart illustrating step 103 of a flowchart of a method of polyphonic disambiguation illustrated in FIG. 1 in accordance with an exemplary embodiment;
FIG. 5 is a block diagram illustrating an apparatus for polyphonic disambiguation in accordance with an exemplary embodiment;
FIG. 6 is a block diagram illustrating another apparatus for polyphonic disambiguation in accordance with an exemplary embodiment;
FIG. 7 is an apparatus block diagram of a polyphonic disambiguation model training module 601 of the apparatus block diagram of another polyphonic disambiguation shown in FIG. 6 in accordance with an example embodiment;
FIG. 8 is an apparatus block diagram of the polyphonic disambiguation module 503 of the apparatus block diagram of a polyphonic disambiguation shown in FIG. 5 in accordance with an exemplary embodiment;
FIG. 9 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
It should be noted that, in the embodiment of the present application, the target polyphonic disambiguation model may be preferably a full convolution model Unet, and therefore, the full convolution model Unet is adopted, because the full convolution model Unet may replace the recurrent neural network model RNN used in the prior art, and a better polyphonic disambiguation effect can be achieved. The full convolution model Unet is most commonly used in the field of image segmentation, and in particular, it derives its name from its symmetrical U-shaped structure, with a convolution layer on the left and an upsampling layer on the right. The feature map obtained by each convolution layer is spliced to the corresponding upsampling layer, so that the feature obtained by each layer can be ensured to be used in subsequent calculation. This also enables the full convolution model Unet to have features that combine layers, thereby improving the understanding of the polyphonic disambiguation model on the overall features and further improving the effect of the polyphonic disambiguation model. In the embodiment of the present application, the full convolution model Unet is used as the target polyphone disambiguation model, and in the actual application process, other full convolution models may also be used to apply to the technical solution of the present application, for example: including but not limited to IDCNN.
FIG. 1 is a flow chart illustrating a method of polyphonic disambiguation, as shown in FIG. 1, according to an example embodiment, including the following steps.
Step 101, dividing a text to be processed into a plurality of characters, wherein the plurality of characters comprise target polyphonic characters and non-target polyphonic characters.
It should be noted that, in the embodiment of the present application, from the perspective of the whole task, the polyphone disambiguation is a sequence-to-sequence labeling task; if the characters are separated, the classification task is performed for each character. For example: the pronunciation option of "line" is xing2 or hang2, where 2 means that the tone of the character pronunciation is yang-Ping (second sound); the pronunciation options of "turn" are zhuanan 2, zhuanan 3, and zhuanan 4, where 3 means that the tone of the pronunciation of the character is up (third sound) and 4 means that the tone of the pronunciation of the character is down (fourth sound).
Therefore, the text to be processed may be first divided into one character, specifically, for example: the 'login people movement credit system' is used as a text to be processed, and the text is divided into one character to obtain 'login, land, people, line, sign, letter, body and unity'.
Further, in the embodiment of the present application, fig. 2 is a flowchart illustrating another method for disambiguating polyphonic words according to an exemplary embodiment, and as shown in fig. 2, the following steps may be further included before step 101.
Step 201, generating a target polyphone disambiguation model in advance.
Further, in the embodiment of the present application, fig. 3 is a flowchart of step 201 in the flowchart of another polyphonic disambiguation method shown in fig. 2 according to an exemplary embodiment, and as shown in fig. 3, step 201 may further include the following steps.
Step 301, a training sample is obtained, where the training sample includes a plurality of sample texts and labeling information of a target polyphonic character in the plurality of sample texts, and the labeling information is used to indicate a pronunciation of the target polyphonic character in the sample texts and a second identifier corresponding to the pronunciation of the target polyphonic character in the sample texts.
And 302, taking the plurality of sample texts as input, taking the marking information of the target polyphonic characters in the plurality of sample texts as an output target, training a preset initial model, and determining the trained model as a target polyphonic disambiguation model.
It should be noted that, in the embodiment of the present application, a preset initial model may be trained according to a training sample, and the trained model is a target polyphonic character disambiguation model, where the target polyphonic character disambiguation model is a full convolution model Unet structure. Specifically, a plurality of sample texts contained in the obtained training samples are used as the input of an initial model, the labeling information of target polyphonic characters in the plurality of sample texts is used as the output target of the initial model, the preset initial model is trained, and the trained model is determined as a target polyphonic disambiguation model, wherein the target polyphonic disambiguation model is a full convolution model Unet. For example: for the reading of the polyphonic character "line" to hang2, a sentence containing the polyphonic character "line" is obtained, wherein in the sentence containing the polyphonic character "line", various readings of the character "line" in the natural language are required, for example: hang2, xing 2; in the text with reading of hang2, the other characters which are not the polyphone character "line" are added with the label of 0, and the other characters which are not the polyphone character "line" will be the polyphone character "line", and the character with reading of hang2 obtains the corresponding label information according to the preset polyphone list, such as: hang2_4, where 2 refers to the tone of the character's pronunciation being positive (second sound) and 4 refers to the identification of the character in the polyphonic list. Taking a text of which ten thousand sentences contain polyphone characters 'lines' and the pronunciation of which is hang2 as the input of an initial model, taking the label information corresponding to the characters, namely hang2_4, as the output target, training the initial model, and taking the trained model as a target polyphone disambiguation model. The form of the labeling information and the specific numerical value of the label are not particularly limited in the present application.
By training the target polyphonic disambiguation model in advance, wherein the target polyphonic disambiguation model is a full convolution model Unet, the trained full convolution model Unet can be applied to the polyphonic disambiguation method, and the polyphonic disambiguation efficiency is improved.
Further, in the embodiment of the present application, fig. 2 is a flowchart illustrating another speech recognition method according to an exemplary embodiment, and as shown in fig. 2, the following steps may be further included before step 101.
Step 202, obtaining a text to be processed.
It should be noted that, in the embodiment of the present application, the text to be processed may be any text, which may include at least one target polyphonic character. The text to be processed can be obtained in any way, and the method for obtaining the text to be processed is not particularly limited in the present application.
Step 102, for each character, acquiring a first identifier corresponding to the character.
It should be noted that, in the embodiment of the present application, for each character marked out from a text to be processed, a first identifier is added to each character, specifically, for example: adding a first identifier to each divided character 'sign, land, person, line, sign, letter, body and system', obtaining 'sign _1, land _2, person _3, line _4, sign _5, letter _6, system _7 and system _ 8', wherein the specific numerical value of the first identifier is not specifically limited, and the form of adding the first identifier is not specifically limited.
Step 103, inputting the characters and the first identifications corresponding to the characters into a pre-generated target polyphonic character disambiguation model, and determining the pronunciation of the target polyphonic character according to the output of the target polyphonic character disambiguation model.
Further, in the embodiment of the present application, fig. 4 is a flowchart of step 103 in the flowchart of a speech recognition method shown in fig. 1 according to an exemplary embodiment, and as shown in fig. 4, step 103 may further include the following steps.
Step 401, obtaining the labeling information of the target polyphone character according to the output of the target polyphone disambiguation model.
Step 402, determining the pronunciation of the target polyphonic character according to the labeling information of the target polyphonic character.
It should be noted that, in this embodiment of the present application, the label information of the target polyphonic character is obtained according to the output of the target polyphonic character disambiguation model, that is, the full convolution model Unet, where the label information includes the pronunciation of the target polyphonic character and the second identifier corresponding to the target polyphonic character, and further, the pronunciation of the target polyphonic character can be determined. For example: the label information of the polyphone character "line" is found to be xing2_8, that is, the corresponding label of the character in the polyphone list is found to be 8, and the pronunciation of the character is xing yang ping (second sound).
The method comprises the steps of dividing a text to be processed into a plurality of characters, wherein the plurality of characters comprise target polyphone characters and non-target polyphone characters; aiming at each character, acquiring a first identifier corresponding to the character; inputting the characters and the first identifications corresponding to the characters into a pre-generated target polyphonic character disambiguation model, and determining the pronunciation of the target polyphonic character according to the output of the target polyphonic character disambiguation model. According to the technical scheme provided by the embodiment of the application, the target polyphone disambiguation model, namely the full convolution model Unet, is used for carrying out polyphone disambiguation on the target polyphone character in the text to be processed, and because the full convolution model Unet is more easily parallelized and does not need to keep and update state, no sequential relation exists between the outputs, and then the full convolution model Unet is used for improving the predicting speed of the polyphone disambiguation, and simultaneously, the context information can be well grasped, and further, the polyphone disambiguation effect is improved. According to the method, the target polyphonic disambiguation model is generated in advance, wherein the target polyphonic disambiguation model is the full convolution model Unet, and then the full convolution model Unet trained in advance can be applied to the polyphonic disambiguation method, so that the processing efficiency and the prediction accuracy of the polyphonic disambiguation are further improved.
Fig. 5 is a block diagram illustrating an apparatus for polyphonic disambiguation according to an example embodiment, and referring to fig. 5, the apparatus includes a partitioning module 501, a first identification obtaining module 502, and a polyphonic disambiguation module 503.
A dividing module 501, configured to divide a text to be processed into a plurality of characters, where the plurality of characters include a target polyphone character and a non-target polyphone character.
A first identifier obtaining module 502, configured to, for each character, obtain a first identifier corresponding to the character.
The polyphonic character disambiguation module 503 is configured to input the character and the first identifier corresponding to the character into a pre-generated target polyphonic character disambiguation model, and determine the pronunciation of the target polyphonic character according to the output of the target polyphonic character disambiguation model.
Alternatively, FIG. 6 is a block diagram illustrating another apparatus for polyphonic disambiguation according to an example embodiment. Referring to fig. 6, the apparatus includes a polyphonic disambiguation model training module 601 and a pending text acquisition module 602.
The polyphonic disambiguation model training module 601 is configured to generate a target polyphonic disambiguation model in advance.
A pending text obtaining module 602, configured to obtain a pending text.
Alternatively, FIG. 7 is an apparatus block diagram of a polyphonic disambiguation model training module 601 of the apparatus block diagram of FIG. 6 illustrating another polyphonic disambiguation according to an example embodiment. Referring to fig. 7, the apparatus includes a training sample acquisition unit 701, a polyphonic disambiguation model training unit 702.
The training sample obtaining unit 701 is configured to obtain a training sample, where the training sample includes a plurality of sample texts and labeling information of a target polyphonic character in the plurality of sample texts, and the labeling information is used to indicate a pronunciation of the target polyphonic character in the sample texts and a second identifier corresponding to the pronunciation of the target polyphonic character in the sample texts.
And the polyphone disambiguation model training unit 702 is configured to train a preset initial model by using the plurality of sample texts as input and using the label information of the target polyphone characters in the plurality of sample texts as an output target, and determine the trained model as a target polyphone disambiguation model.
Alternatively, fig. 8 is an apparatus block diagram of the polyphonic disambiguation module 503 in the apparatus block diagram of the polyphonic disambiguation apparatus shown in fig. 5 according to an example embodiment. Referring to fig. 8, the apparatus includes a label information acquisition unit 801, a pronunciation acquisition unit 802 of a target polyphonic character.
And a label information obtaining unit 801, configured to obtain label information of the target polyphonic character according to the output of the target polyphonic disambiguation model.
A pronunciation obtaining unit 802 for the target polyphonic character, configured to determine the pronunciation of the target polyphonic character according to the label information of the target polyphonic character.
The method comprises the steps of dividing a text to be processed into a plurality of characters, wherein the plurality of characters comprise target polyphone characters and non-target polyphone characters; aiming at each character, acquiring a first identifier corresponding to the character; inputting the characters and the first identifications corresponding to the characters into a pre-generated target polyphonic character disambiguation model, and determining the pronunciation of the target polyphonic character according to the output of the target polyphonic character disambiguation model. According to the technical scheme provided by the embodiment of the application, the target polyphone disambiguation model, namely the full convolution model Unet, is used for carrying out polyphone disambiguation on the target polyphone character in the text to be processed, and because the full convolution model Unet is more easily parallelized and does not need to keep and update state, no sequential relation exists between the outputs, and then the full convolution model Unet is used for improving the predicting speed of the polyphone disambiguation, and simultaneously, the context information can be well grasped, and further, the polyphone disambiguation effect is improved. According to the method, the target polyphonic disambiguation model is generated in advance, wherein the target polyphonic disambiguation model is the full convolution model Unet, and then the full convolution model Unet trained in advance can be applied to the polyphonic disambiguation method, so that the processing efficiency and the prediction accuracy of the polyphonic disambiguation are further improved.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 9 is a block diagram illustrating an electronic device 900 in accordance with an example embodiment. For example, the electronic device 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 9, electronic device 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output interface 912, sensor component 914, and communication component 916.
The processing component 902 generally controls overall operation of the device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 906 provides power to the various components of the electronic device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 900.
The multimedia components 908 include a screen that provides an output interface between the electronic device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.
Input/output interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 914 includes one or more sensors for providing status evaluations of various aspects of the electronic device 900. For example, sensor assembly 914 may detect an open/closed state of electronic device 900, the relative positioning of components, such as a display and keypad of electronic device 900, sensor assembly 914 may also detect a change in the position of electronic device 900 or a component of electronic device 900, the presence or absence of user contact with electronic device 900, orientation or acceleration/deceleration of electronic device 900, and a change in the temperature of electronic device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 916 is configured to facilitate wired or wireless communication between the electronic device 900 and other devices. The electronic device 900 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1116 also includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the electronic device 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A method of polyphonic disambiguation, the method comprising:
dividing a text to be processed into a plurality of characters, wherein the plurality of characters comprise target polyphone characters and non-target polyphone characters;
aiming at each character, acquiring a first identifier corresponding to the character;
inputting the characters and the first identifications corresponding to the characters into a pre-generated target polyphonic character disambiguation model, and determining the pronunciation of the target polyphonic character according to the output of the target polyphonic character disambiguation model.
2. The method of claim 1, wherein before the step of dividing the text to be processed into a number of characters, wherein the number of characters includes a target polyphonic character and a non-target polyphonic character, further comprising:
generating a target polyphone disambiguation model in advance;
and acquiring a text to be processed.
3. The method of claim 2, wherein the pre-generating a target polyphonic disambiguation model comprises:
acquiring a training sample, wherein the training sample comprises a plurality of sample texts and marking information of target polyphonic characters in the plurality of sample texts, and the marking information is used for indicating pronunciation of the target polyphonic characters in the sample texts and second identification corresponding to the pronunciation of the target polyphonic characters in the sample texts;
and taking the sample texts as input, taking the marking information of the target polyphone characters in the sample texts as an output target, training a preset initial model, and determining the trained model as a target polyphone disambiguation model.
4. The method of claim 1, wherein determining the pronunciation of the target polyphonic character from the output of the target polyphonic disambiguation model comprises:
obtaining the marking information of the target polyphone character according to the output of the target polyphone disambiguation model;
and determining the pronunciation of the target polyphonic character according to the labeling information of the target polyphonic character.
5. An apparatus for polyphonic disambiguation, the apparatus comprising:
the device comprises a dividing module, a processing module and a processing module, wherein the dividing module is used for dividing a text to be processed into a plurality of characters, and the plurality of characters comprise target polyphonic characters and non-target polyphonic characters;
the first identification acquisition module is used for acquiring a first identification corresponding to each character;
and the polyphone disambiguation module is used for inputting the characters and the first identifications corresponding to the characters into a pre-generated target polyphone disambiguation model and determining the pronunciation of the target polyphone characters according to the output of the target polyphone disambiguation model.
6. The apparatus of claim 5, further comprising:
the polyphone disambiguation model training module is used for generating a target polyphone disambiguation model in advance;
and the to-be-processed text acquisition module is used for acquiring the to-be-processed text.
7. The apparatus of claim 6, wherein the polyphonic disambiguation model training module comprises:
the training sample acquisition unit is used for acquiring a training sample, wherein the training sample comprises a plurality of sample texts and marking information of target polyphonic characters in the plurality of sample texts, and the marking information is used for indicating pronunciation of the target polyphonic characters in the sample texts and second identification corresponding to the pronunciation of the target polyphonic characters in the sample texts;
and the polyphone disambiguation model training unit is used for taking the plurality of sample texts as input, taking the marking information of the target polyphone characters in the plurality of sample texts as an output target, training a preset initial model, and determining the trained model as the target polyphone disambiguation model.
8. The apparatus of claim 5, wherein the polyphonic disambiguation module further comprises:
the label information acquisition unit is used for acquiring label information of the target polyphonic characters according to the output of the target polyphonic disambiguation model;
and the pronunciation acquisition unit of the target polyphonic character is used for determining the pronunciation of the target polyphonic character according to the labeling information of the target polyphonic character.
9. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the instructions to implement a method of polyphonic disambiguation as claimed in any of claims 1 to 4.
10. A computer readable storage medium having instructions that, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of polyphonic disambiguation as claimed in any of claims 1 to 4.
CN202210086347.7A 2022-01-25 2022-01-25 Multi-tone word disambiguation method and device, electronic equipment and readable storage medium Pending CN114550691A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210086347.7A CN114550691A (en) 2022-01-25 2022-01-25 Multi-tone word disambiguation method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210086347.7A CN114550691A (en) 2022-01-25 2022-01-25 Multi-tone word disambiguation method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN114550691A true CN114550691A (en) 2022-05-27

Family

ID=81670657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210086347.7A Pending CN114550691A (en) 2022-01-25 2022-01-25 Multi-tone word disambiguation method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN114550691A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273809A (en) * 2022-06-22 2022-11-01 北京市商汤科技开发有限公司 Training method of polyphone pronunciation prediction network, and speech generation method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273809A (en) * 2022-06-22 2022-11-01 北京市商汤科技开发有限公司 Training method of polyphone pronunciation prediction network, and speech generation method and device

Similar Documents

Publication Publication Date Title
CN107291690B (en) Punctuation adding method and device and punctuation adding device
JP2017535007A (en) Classifier training method, type recognition method and apparatus
CN110874145A (en) Input method and device and electronic equipment
CN109961791B (en) Voice information processing method and device and electronic equipment
CN107133354B (en) Method and device for acquiring image description information
CN109961094B (en) Sample acquisition method and device, electronic equipment and readable storage medium
CN108628813B (en) Processing method and device for processing
CN111831806B (en) Semantic integrity determination method, device, electronic equipment and storage medium
US11335348B2 (en) Input method, device, apparatus, and storage medium
CN111326138A (en) Voice generation method and device
CN112562675A (en) Voice information processing method, device and storage medium
CN112735396A (en) Speech recognition error correction method, device and storage medium
CN111160047A (en) Data processing method and device and data processing device
CN111797262A (en) Poetry generation method and device, electronic equipment and storage medium
CN113920559A (en) Method and device for generating facial expressions and limb actions of virtual character
CN114550691A (en) Multi-tone word disambiguation method and device, electronic equipment and readable storage medium
CN105913841B (en) Voice recognition method, device and terminal
CN112948565A (en) Man-machine conversation method, device, electronic equipment and storage medium
CN110930977B (en) Data processing method and device and electronic equipment
CN113923517B (en) Background music generation method and device and electronic equipment
CN109887492B (en) Data processing method and device and electronic equipment
CN109979435B (en) Data processing method and device for data processing
CN113115104B (en) Video processing method and device, electronic equipment and storage medium
CN108346423B (en) Method and device for processing speech synthesis model
CN114155849A (en) Virtual object processing method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination