CN112257420B - Text processing method and device - Google Patents

Text processing method and device Download PDF

Info

Publication number
CN112257420B
CN112257420B CN202011133952.2A CN202011133952A CN112257420B CN 112257420 B CN112257420 B CN 112257420B CN 202011133952 A CN202011133952 A CN 202011133952A CN 112257420 B CN112257420 B CN 112257420B
Authority
CN
China
Prior art keywords
text
pinyin
initial
polyphone
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011133952.2A
Other languages
Chinese (zh)
Other versions
CN112257420A (en
Inventor
蒋荣正
夏龙
马楠
杨明祺
郭常圳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ape Power Future Technology Co Ltd
Original Assignee
Beijing Ape Power Future Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ape Power Future Technology Co Ltd filed Critical Beijing Ape Power Future Technology Co Ltd
Priority to CN202011133952.2A priority Critical patent/CN112257420B/en
Publication of CN112257420A publication Critical patent/CN112257420A/en
Application granted granted Critical
Publication of CN112257420B publication Critical patent/CN112257420B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The specification provides a text processing method and a text processing device, wherein the text processing method comprises the following steps: acquiring an initial text carrying a polyphone mark, wherein the initial text contains at least one polyphone; determining an ith pinyin sequence corresponding to the initial text, and constructing at least one word group containing polyphones according to the polyphone identification and the initial text, wherein i is a value from 1 and is a positive integer; determining a phrase pinyin sequence of the element phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence into a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence; under the condition that the word group and the reference word group are inconsistent, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed; and under the condition that the element phrase is consistent with the reference phrase, creating a text pinyin group based on the polyphone identification, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.

Description

Text processing method and device
Technical Field
The present disclosure relates to the field of text processing technologies, and in particular, to a text processing method and apparatus.
Background
With the development of internet technology, the requirements of more application scenes on the quantity and quality of data are higher and higher, and the data used by different scenes are different, and in the field of machine learning, different models are built according to different use requirements, and different models also need to be trained by using different sample data, such as image processing scenes, and the models applied in the scenes need to be trained by using image data; such as an audio processing scene, it is necessary to train the models applied in the scene using audio data; such as a text processing scene, it is necessary to train models or the like applied in the scene using text data; in order to train out the model meeting the use requirement, the sample data needs to be preprocessed in the data preparation stage, such as marking, constructing a sample pair and the like, which are all preparation operations meeting the model training requirement, the process directly affects the accuracy of the trained model, when the sample data is marked in the prior art, the model is realized in a manual marking mode, the efficiency is low, the accuracy rate cannot be ensured in the manual marking mode, and errors are easily caused in the model training process, so that an effective scheme is needed to solve the problems.
Disclosure of Invention
In view of this, the present embodiments provide a text processing method. The present specification also relates to a text processing apparatus, a computing device, and a computer-readable storage medium, which solve the technical drawbacks existing in the prior art.
According to a first aspect of embodiments of the present specification, there is provided a text processing method, including:
acquiring an initial text carrying a polyphone mark, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one word group containing the polyphones according to the polyphone identification and the initial text, wherein i is a value from 1 and is a positive integer;
Determining a phrase pinyin sequence of the element phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence into a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
Under the condition that the word group is inconsistent with the reference word group, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the word group is consistent with the reference word group, creating a text pinyin group based on the polyphone identification, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
Optionally, before the step of obtaining the initial text carrying the polyphone identifier is performed, the method further includes:
collecting a text to be processed, and carrying out normalization processing on the text to be processed to obtain a standard text;
determining a standard polyphone in the standard text based on a preset polyphone dictionary, and marking the standard polyphone;
And obtaining a standard text carrying the polyphone mark according to the marking result, and writing the standard text carrying the polyphone mark into a standard text library.
Optionally, the acquiring the initial text carrying the polyphone identifier includes:
And under the condition that an updating request for updating the multi-tone word text library is received, extracting the initial text carrying a multi-tone word identifier from the standard text library based on the updating request, wherein the multi-tone word identifier is used for marking the character position of at least one multi-tone word contained in the initial text.
Optionally, the determining the ith pinyin sequence corresponding to the initial text includes:
inputting the initial text to a pinyin generation module for processing to obtain an ith pinyin sequence corresponding to the initial text output by the pinyin generation module, wherein i is a value from 1 and is a positive integer.
Optionally, constructing at least one word group containing the polyphones according to the polyphone identifier and the initial text, including:
determining a character position of the polyphones in the initial text based on the polyphones identification;
Determining adjacent character positions adjacent to the character positions through a preset selection strategy, and determining adjacent words corresponding to the adjacent character positions according to the initial text;
and constructing at least one word group consisting of the adjacent words and the polyphones according to the arrangement sequence of the adjacent words and the polyphones in the initial text.
Optionally, the determining the phrase pinyin sequence of the element phrase according to the ith pinyin sequence includes:
preprocessing the initial text to obtain a plurality of initial characters, and preprocessing the meta word group to obtain a plurality of meta characters;
determining the pinyin of each initial character in the plurality of initial characters according to the ith pinyin sequence;
Determining pinyin for each of the plurality of meta-characters based on pinyin for each of the plurality of initial characters;
And generating the phrase pinyin sequence according to the pinyin of each meta character in the plurality of meta characters.
Optionally, after i is increased by 1 and the step of determining the ith pinyin sequence corresponding to the initial text is executed in the case that the word group and the reference word group are inconsistent, the method further includes:
detecting whether the (i+1) th pinyin sequence is consistent with the (i) th pinyin sequence;
if not, executing the step of constructing at least one word group containing the polyphones according to the polyphone identification and the initial text;
and if so, writing the initial text into a non-standard text library.
Optionally, the creating a text pinyin group based on the multi-tone word identifier, the initial text, and the i-th pinyin sequence includes:
determining the pinyin position of the pinyin corresponding to the polyphones in the ith pinyin sequence based on the polyphones identification;
extracting the pinyin corresponding to the polyphones from the ith pinyin sequence according to the pinyin position;
And integrating the initial text, the polyphone identifier and the pinyin corresponding to the polyphone to obtain the text pinyin group.
Optionally, after the step of creating a text pinyin group based on the polyphone identifier, the initial text, and the i-th pinyin sequence and writing the text pinyin group into the polyphone text library is performed, the method further includes:
Reading training texts in the multi-word text library according to the reading request under the condition that the reading request submitted for the multi-word text library is received;
Under the condition that a reading request submitted for the multi-word text library is received, reading a training text pinyin group in the multi-word text library according to the reading request;
analyzing the training text pinyin group to obtain a training initial text and a training pinyin sequence;
And training the initial pinyin labeling model based on the training initial text and the training pinyin sequence to obtain a target pinyin labeling model.
Optionally, the initial text is an initial chinese text, and the pinyin contained in the ith pinyin sequence has a tone.
According to a second aspect of embodiments of the present specification, there is provided a text processing apparatus comprising:
the acquisition module is configured to acquire an initial text carrying a polyphone identifier, wherein the initial text comprises at least one polyphone;
the determining module is configured to determine an ith pinyin sequence corresponding to the initial text, and construct at least one word group containing the polyphones according to the polyphone identification and the initial text, wherein i takes a value from 1 and is a positive integer;
The processing module is configured to determine a phrase pinyin sequence of the element phrase according to the ith pinyin sequence, input the phrase pinyin sequence into the text generation module for processing, and obtain a reference phrase corresponding to the phrase pinyin sequence;
under the condition that the word group is inconsistent with the reference word group, i is increased by 1, and the determining module is operated;
And under the condition that the word group is consistent with the reference word group, a writing module is operated, and the writing module is configured to create a text pinyin group based on the polyphone identifier, the initial text and the i-th pinyin sequence and write the text pinyin group into a polyphone text library.
According to a third aspect of embodiments of the present specification, there is provided a computing device comprising:
a memory and a processor;
the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions:
acquiring an initial text carrying a polyphone mark, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one word group containing the polyphones according to the polyphone identification and the initial text, wherein i is a value from 1 and is a positive integer;
Determining a phrase pinyin sequence of the element phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence into a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
Under the condition that the word group is inconsistent with the reference word group, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the word group is consistent with the reference word group, creating a text pinyin group based on the polyphone identification, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
According to a fourth aspect of embodiments of the present description, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the text processing method.
According to the text processing method provided by the specification, after the initial text containing the multi-tone words is obtained, the pinyin sequence of the initial text is determined, at least one word group containing the multi-tone words is built based on the multi-tone word identification carried by the initial text, then the word group pinyin sequence of the word group is determined according to the obtained pinyin sequence, meanwhile, a reference word group is generated based on the word group pinyin sequence, then the correctness of the pinyin sequence is checked in a mode of comparing the reference word group with the word group pinyin sequence, if the check result is inconsistent, the new pinyin sequence of the initial text is redetermined, the process is executed until the check result is consistent, the correct pinyin of the multi-tone words in the initial text can be determined, the pinyin sequence, the multi-tone word identification and the initial text are integrated into a text pinyin group under the condition of consistent check result, and the multi-tone word text library is written, so that when the pinyin labels are carried on the multi-tone words in the initial text, the correct pinyin group can be determined in a verification mode, manpower and material resources can be saved, the accuracy of the finally created text group can be effectively ensured, the correct text is realized, the quality of the multi-tone words can be further built in a corresponding database, and the quality of the business is further, and the quality of the business is improved.
Drawings
FIG. 1 is a flow chart of a text processing method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a text processing method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a multi-tone word dictionary in a text processing method according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of text to be processed in a text processing method according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of normalization processing in a text processing method according to an embodiment of the present disclosure;
FIG. 6 is a process flow diagram of a model training process provided in an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a text processing device according to an embodiment of the present disclosure;
FIG. 8 is a block diagram of a computing device according to one embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.
First, terms related to one or more embodiments of the present invention will be explained.
Tone: refers to the change in the elevation of sound. In modern Chinese phonetic, tone refers to the inherent in Chinese syllable, and can distinguish the level and elevation of the sound of meaning; the tones in Chinese include five types, which correspond to yin level (-), yang level (/), up (v), down (\) and light. If the Pinyin of the mother is m ā, the corresponding tone is yin; the Pinyin of the hemp is m and the corresponding tone is yang Ping; the pinyin of the horse is m haws, the corresponding tone is the rising sound, the curse pinyin is m a, and the corresponding tone is the falling sound; the pinyin for the case is ma and the corresponding tone is light.
In the present specification, a text processing method is provided, and the present specification relates to a text processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
In practical application, due to the characteristics of the polyphones, the pronunciations of different characters in different texts are different, and when the spelling of the polyphones is marked, the polyphones can be determined according to the upper and lower Wen Yuyi, in the prior art, when the spelling of the polyphones in the texts is marked, the correct pronunciation of the polyphones in the texts is usually determined in a manual verification mode, and then the spelling is marked, so that the process is time-consuming and labor-consuming, and the auditor needs to ensure that the language work of the auditor is high, so that the correct spelling of the polyphones can not be marked, and therefore, the quality and the quantity of data written into the polyphone text library can not be ensured, and the improvement of the updating or construction efficiency of the polyphone text library is a problem to be solved.
According to the text processing method provided by the specification, in order to ensure the accuracy of multi-tone word pinyin labeling and improve the labeling efficiency, after an initial text containing multi-tone words is acquired, the pinyin sequence of the initial text is determined, at least one word group containing multi-tone words is built based on multi-tone word identifiers carried by the initial text, then the word group pinyin sequence of the word group is determined according to the acquired pinyin sequence, a reference word group is generated based on the word group pinyin sequence, the accuracy of the pinyin sequence is checked by comparing the reference word group with the word group, if the checking results are inconsistent, the new pinyin sequence of the initial text is determined again, the process is executed until the checking results are consistent, the correct pinyin of the multi-tone words in the initial text can be determined, then the pinyin sequence, the multi-tone word identifiers and the initial text under the condition that the checking results are consistent are integrated into a text pinyin group, the fact that when the multi-tone words in the initial text are labeled is realized, the correct pinyin of the multi-tone words can be determined through the checking method, the fact that the number of the multi-tone words in the initial text is not only can be saved, and finally the service quality of the corresponding text is improved, and the quality of the multi-tone words can be further built, and the service quality of the corresponding text is further can be improved.
Fig. 1 shows a flowchart of a text processing method according to an embodiment of the present disclosure, which specifically includes the following steps:
Step S102, an initial text carrying a polyphone identifier is obtained, wherein the initial text contains at least one polyphone.
In specific implementation, the polyphones specifically refer to Chinese characters with a plurality of pronunciations, such as 'on', reading Lihao (tone is 3-up sound) in the phrase 'know', and reading le (tone is 5-light sound) in the phrase 'good'; correspondingly, the initial text specifically refers to a text containing at least one polyphone, and the polyphone identification specifically refers to an identification for marking the position of the polyphone in the initial text; in addition, since the pinyin tone cannot be marked when the pinyin is marked, in order to correctly identify the correct pronunciation of each polyphone, the correct pronunciation of the polyphone is marked in a digital combination pinyin mode, if the correct pronunciation is li (o), the correct pronunciation is expressed in a digital combination pinyin mode at the moment: liao3 (meaning tone 3-up); in this embodiment, for convenience of description, the pinyin with the tone as the shade is combined with the number 1 to be expressed, for example, the pinyin of the mother is m ā, and the expression form is ma1; the pinyin with the tone of yangping is combined with the number 2 and then expressed, for example, the pinyin of the hemp is mα, and the expression form is ma2; the phonetic transcription with the tone being the upper voice is combined with the number 3 to be expressed, if the phonetic transcription of the horse is m-haw, the expression form is ma3; the tone is the combination of the phonetic transcription of the voice and the number 4, and then the expression is carried out, if the phonetic transcription of the curse is m-a, the expression form is ma4; the combination of the pinyin with the tone of light sound and the number 5 expresses that the pinyin is ma and the expression form is ma5. In practical application, in order to express the correct tone of each pinyin in detail, other combinations may be selected for implementation, for example, the pinyin and the symbol (#) are expressed in combination, and the embodiment of the specific combination expression is not limited in any way.
The text processing method provided in this embodiment will take the initial text as an initial chinese text, where the initial text includes a polyphone as an example, and describing the text processing method, when the corresponding initial text includes two or more polyphones, reference may be made to the corresponding description content in this embodiment, which will not be repeated herein, and it should be noted that when the initial text includes two or more polyphones, the text pinyin group is also created to write into the polyphone text library after determining that the pinyin of each polyphone is correct, so as to satisfy the downstream service.
Referring to the schematic diagram of the text processing method shown in fig. 2, after obtaining an initial text (expressed as (1, 2) ab …, wherein 1 and 2 represent polyphone identifiers used for representing the positions of polyphones in the text, namely, the second Chinese character and the third Chinese character from left to right are polyphones, for convenience of expression, a represents the second Chinese character as the polyphones, b represents the third Chinese character as the polyphones) carrying polyphones identifiers from a standard text library, determining a plurality of pinyin sequences corresponding to the initial text, then using a text generation module to generate a reference text based on one pinyin sequence in the plurality of pinyin sequences, checking whether the pinyin sequences correctly label the pinyin of each Chinese character in the initial text by checking the reference text and the initial text, and if the pinyin identifiers are consistent, indicating that the pinyin identifiers are correct, directly creating a text group and writing the text group into the polyphone text library; if the two sequences are inconsistent, selecting the next pinyin sequence to generate a new reference text, and then executing the verification process until the pinyin sequences consistent with the two sequences are obtained, and then forming a text pinyin group and writing the text pinyin group into a multi-sound word text library.
In practical application, there may be a situation that reference texts generated by a plurality of pinyin sequences contained in the plurality of pinyin sequences are inconsistent with the initial texts, and at this time, the fact that the initial texts cannot be correctly pinyin-marked is explained, so that the initial texts can be deleted from a standard text library, and therefore useless data occupy storage space, and waste of storage resources is caused.
Based on this, in order to obtain an initial text carrying polyphones and polyphone identifiers, a large number of texts to be processed need to be collected for preferentially constructing a canonical text library for subsequently perfecting the polyphone text library, and in this embodiment, the specific implementation manner is as follows:
collecting a text to be processed, and carrying out normalization processing on the text to be processed to obtain a standard text;
determining a standard polyphone in the standard text based on a preset polyphone dictionary, and marking the standard polyphone;
And obtaining a standard text carrying the polyphone mark according to the marking result, and writing the standard text carrying the polyphone mark into a standard text library.
Specifically, the text to be processed specifically refers to a chinese text captured through big data, the standard text specifically refers to a text obtained by processing the text to be processed, the multi-tone dictionary specifically refers to a dictionary storing a large number of multi-tone words and corresponding pinyin thereof, referring to a schematic diagram of the multi-tone dictionary shown in fig. 3, in the multi-tone dictionary, the pinyin corresponding to each multi-tone word is expressed through a relation of mapping between chinese characters and pinyin, and if the corresponding pinyin has zhao2, zhuo, zhao1 and zhe5, it is to be noted that, in practical application, the multi-tone word dictionary shown in fig. 3 can be written into the multi-tone word dictionary according to actual requirements. The standard polyphones specifically refer to polyphones contained in the standard text.
Based on the method, the polyphone dictionary can only determine polyphones in the text and cannot correctly label correct pinyin of the polyphones, so that after the text to be processed is acquired, the polyphone dictionary is used for marking the standard text, so that the standard text carrying the polyphone identification is obtained and written into the standard text library, and when the polyphone text library is updated or constructed later, the initial text meeting the pinyin labeling requirement can be extracted from the standard text library.
In this process, since different texts cannot be directly applied to the marking of the polyphones, see the texts in fig. 4, such as the second text and the third text, if the marking of the polyphones is directly performed, the reference phrase and the element phrase cannot be correctly compared during the subsequent verification, and the update or the construction of the polyphone text library cannot be completed, so that in order to be effectively applied to the subsequent processing process, after the text to be processed is obtained, the normalization processing can be performed on the text to be processed to obtain the standard text, and then the subsequent polyphone marking can be performed on the standard text.
In the normalization process, the non-Chinese characters or symbols in the text to be processed are converted into Chinese characters, namely, digital normalization processing, symbol normalization processing, unit normalization processing and translation normalization processing are carried out, for example, 1 in the text to be processed is converted into one; and converting English punctuation marks in the text to be processed into Chinese punctuation marks. "; converting the unit kg in the text to be processed into the Chinese unit kg; and (3) converting English 'hi' in the text to be processed into Chinese characters 'hello' and the like, so that the Chinese text meeting the labeling requirement is obtained, and then labeling the polyphones.
Based on this, referring to the schematic diagram of the normalization processing procedure shown in fig. 5, after the text to be processed is acquired, at this time, the text to be processed is normalized to obtain a standard text corresponding to the text to be processed, then, a standard polyphone in the standard text is determined based on a preset polyphone dictionary, and meanwhile, the standard polyphone is marked, that is, the position of the polyphone in the text to be processed is marked, the standard text carrying the polyphone identifier can be obtained according to the marking result, and finally, the standard text carrying the polyphone identifier is written into a standard text library, so that the standard text library meeting the use requirement is created, and the use in updating or building of the polyphone text library is facilitated.
Further, when the multi-tone word text library is required to be updated, extracting the initial text carrying the multi-tone word mark from the standard text library according to the received updating request; for subsequent text processing, thereby updating a multi-word text library meeting downstream usage requirements, wherein the multi-word identifier is used to mark character positions of at least one of the multi-words contained in the initial text.
For example, the collected text to be processed is shown in fig. 4, which is "sit against the sun", "what the circumference of one O has to be known", "reciprocal 0", … …; at this time, it is determined that "what the circumference of one O needs to be known" and "how the reciprocal of 0" need to be normalized, so as to obtain a corresponding labeled text "what the circumference of one circle needs to be known" and "how the reciprocal of zero" need to be known ", then, the multi-tone words in each standard text are labeled by using a preset multi-tone word dictionary, and the text written into the standard text library as shown in fig. 5 can be obtained according to the labeling result; namely, the initial text carrying the polyphone mark corresponding to the ' facing the sun ' is [ (1, 2, 5) -facing the sun's vehicle ] … ….
In summary, in order to facilitate the subsequent updating or construction of the polyphone text library, the standard text library is constructed before the text processing, the collected text to be processed is normalized and then subjected to polyphone marking, and the marked standard text is written into the standard text library, so that the updating or construction of the polyphone text library can be completed by using a comparatively standard initial text, the data quality of the polyphone text library can be ensured, the efficiency of updating or constructing the polyphone text library can be improved, and the completion efficiency of downstream business is further improved.
Step S104, determining an ith pinyin sequence corresponding to the initial text, and constructing at least one word group containing the polyphones according to the polyphone identification and the initial text, wherein i is a value from 1 and is a positive integer.
Specifically, on the basis of obtaining the initial text carrying the multi-tone character mark, generating an ith pinyin sequence corresponding to the initial text preliminarily, wherein the ith pinyin sequence specifically refers to a pinyin sequence generated according to one pronunciation of the multi-tone character in the initial text, then performing a subsequent verification process through the pinyin sequence, writing the pinyin sequence into a multi-tone character text library when a verification result meets a condition of creating a text pinyin group, generating a next pinyin sequence according to the next pronunciation of the multi-tone character in the initial text when the verification result does not meet the condition of creating the text pinyin group, and performing a subsequent verification process until a creation condition is met or a new pinyin sequence can be generated; it should be noted that, the pinyin contained in the ith pinyin sequence has a tone.
Based on this, after generating the ith pinyin sequence for the initial text (where i starts to take a value from 1 and i is a positive integer, and the maximum value of i is the number of pronunciations of the polyphones), at this time, it is described whether the generated ith pinyin sequence needs to be checked, that is, whether the pinyin corresponding to the polyphones in the generated ith pinyin sequence is the correct pronunciation of the polyphones in the initial text, and in order to improve the check accuracy, at least one element phrase containing the polyphones can be constructed according to the polyphone identifier and the initial text, and used for subsequently generating the reference phrase, and checking the pinyin accuracy of the polyphones is realized by comparing the reference phrase with the element phrase, where the element phrase specifically refers to the phrase containing the polyphones, and the characters forming the phrase are all in the initial text and are adjacent to the polyphones, so as to realize that the element phrase with meaning and the intention of being easy to generate semantics and express can be further improved.
Further, in the process of determining the ith pinyin sequence corresponding to the initial text, the method may be implemented by using a preset pinyin generation module, where the pinyin generation module may be a pinyin generation model or a pinyin generation tool (may generate the pinyin of each Chinese character by querying a dictionary), that is, the initial text is input to the pinyin generation module to be processed, so that the ith pinyin sequence corresponding to the initial text output by the pinyin generation module may be obtained, and it is required to say that an output result of the pinyin generation module may be one or more pinyin sequences, and then one of the one or more pinyin sequences may be selected as the ith pinyin sequence corresponding to the initial text.
Further, after the ith pinyin sequence is determined, at least one word group including the polyphones can be constructed by combining the polyphone identifier and the initial text at this time, so as to achieve the effect of improving the accuracy of checking the pinyin of the polyphones, and in this embodiment, the specific implementation manner is as follows:
determining a character position of the polyphones in the initial text based on the polyphones identification;
Determining adjacent character positions adjacent to the character positions through a preset selection strategy, and determining adjacent words corresponding to the adjacent character positions according to the initial text;
and constructing at least one word group consisting of the adjacent words and the polyphones according to the arrangement sequence of the adjacent words and the polyphones in the initial text.
Specifically, the character position specifically refers to the position of the polyphone in the initial text, the selection strategy specifically refers to the rule for generating the word group, the adjacent character position specifically refers to the position corresponding to the character adjacent to the polyphone in front of and behind, and the adjacent character is the character adjacent to the polyphone.
Based on the method, firstly, determining the character position of the polyphones in the initial text according to the polyphone identification, then determining the adjacent character position adjacent to the character position according to a preset selection strategy, if the position of 5 words adjacent to the polyphones before and after is selected as the adjacent character position, determining the adjacent words corresponding to the polyphones in the initial text according to the adjacent character position, and finally constructing at least one meta word group consisting of the adjacent words and the polyphones according to the arrangement sequence of the adjacent words and the polyphones in the initial text.
It should be noted that, because the pronunciation corresponding to the polyphones in different phrases may be different, in order to accurately analyze the correctness of the current i-th pinyin sequence, a plurality of word groups may be created, and then each word group is checked one by one, so long as any word group is consistent with the reference word group, the i-th pinyin sequence may be considered to be correct, i.e. the pinyin of the polyphones is correct, and then the subsequent text processing may be performed; in addition, verification of the ith pinyin sequence can be performed in a duty ratio analysis mode, namely if the duty ratio of the consistency ratio of the element phrase and the reference phrase is higher than a certain duty ratio threshold value, the ith pinyin sequence can be considered to be correct; in practical application, the specific verification policy may be set according to the actual requirement, and the embodiment is not limited in any way. The word groups and the reference word groups mentioned in the embodiment can be identical, or can be partially identical or identical after each word group is identical with the reference word group.
Along the above example, when the initial text [ (1, 2, 5) -facing the sun sitting vehicle ] extracted from the standard text library is obtained, the "on" and "facing" multi-tone words can be determined through the multi-tone word identifier carried in the initial text, and the process of processing the "on" multi-tone words will be described in this illustration, and the corresponding description is illustrated in this embodiment, but is not limited in any way.
Based on this, it can be determined that the initial text is [ (1) -facing the sun, and then the initial text is input to the pinyin generation module for processing, so as to obtain a plurality of pinyin sequences, namely a first pinyin sequence { ying2-zhao 2-chao-yang 2-zuo4-che1} and a second pinyin sequence { ying-zhe 5-chao-yang 2-zuo4-che1}, at this time, the first pinyin sequence { ying-zhao 2-chao1-yang2-zuo4-che1} is selected for checking the correctness of the multi-phonetic word "facing" the pinyin, then the multi-phonetic word "facing" is determined to be the second Chinese character according to the multi-phonetic word identification, at this time, one word before and after "the multi-phonetic word" is selected for forming the first Chinese character { facing the face }, and two words { facing the sun } are selected for each other { two words before and two words before "facing the sun }, and two words before { two words before" facing the sun "are selected for the multi-phonetic word" respectively { two words before "facing the sun, and two words before" four words before "the first phonetic word" is selected for the fact that the multi-phonetic word "is selected.
In summary, in order to perform verification by using the pinyin of the multi-tone word, at least one word group is generated by combining the initial text for the subsequent analysis and processing process, so that the verification accuracy can be ensured, the constructed word group can be ensured not to deviate from the meaning expressed by the initial text, and the verification accuracy is further improved.
Step S106, determining the phrase spelling sequence of the element phrase according to the ith spelling sequence, and inputting the phrase spelling sequence into a text generation module for processing to obtain a reference phrase corresponding to the phrase spelling sequence.
Specifically, on the basis of determining the ith pinyin sequence corresponding to the initial text and the element phrase, further, creating a reference phrase for checking the correctness of the multi-tone word pinyin according to the ith pinyin sequence, wherein the reference phrase specifically refers to a phrase which is compared with the element phrase, if the multi-tone word characters in the reference phrase are the same as the multi-tone word characters in the element phrase, the pinyin of the multi-tone word in the ith pinyin sequence is correct, otherwise, the pinyin of the multi-tone word in the ith pinyin sequence is incorrect, namely the reference phrase is a standard for checking the mispronounced and paired multi-tone word; before generating the reference phrase, the phrase pinyin sequence of the element phrase is required to be determined, and then the reference phrase is generated through the phrase pinyin sequence, so that the specific comparison condition of the element phrase and the reference phrase can be analyzed.
Based on the above, since the pinyin correctness of the polyphones in the word group is required to be checked, the pinyin of the polyphones in the word group can be checked by adopting the way of generating the reference word group by the word group pinyin sequence of the word group, namely, the word group and the reference word group are compared, and if the word group and the reference word group are consistent, the pinyin expression of the polyphones in the word group pinyin sequence is correct, and the i-th pinyin sequence of the initial text is correct for subsequent generation of the text pinyin group.
In this process, only if the text generation module is guaranteed to be correct based on the reference phrase generated by the phrase pinyin sequence, the pinyin verification of the polyphone in the initial text can be achieved, so that the text generation module provided by the embodiment is created by adopting a cloud input method.
In addition, in practical application, the text generation module can also be realized by using a text processing model in the machine learning field, and it is to be noted that the text generation module can only be applied under the condition of ensuring that the text generation module generates the accuracy of the reference phrase through the phrase pinyin sequence so as to meet the requirement of accurately checking the pinyin accuracy of the polyphones in the initial text.
Further, in the process of generating the phrase pinyin sequence, since the word group is constructed based on the characters in the initial text, the phrase pinyin sequence of the word group can be determined according to the ith pinyin sequence, and in this embodiment, the specific implementation manner is as follows:
preprocessing the initial text to obtain a plurality of initial characters, and preprocessing the meta word group to obtain a plurality of meta characters;
determining the pinyin of each initial character in the plurality of initial characters according to the ith pinyin sequence;
Determining pinyin for each of the plurality of meta-characters based on pinyin for each of the plurality of initial characters;
And generating the phrase pinyin sequence according to the pinyin of each meta character in the plurality of meta characters.
Specifically, preprocessing specifically refers to word segmentation processing on the initial text and the word group; based on the above, firstly, word segmentation processing is performed on the initial text to obtain a plurality of initial characters, meanwhile, word segmentation processing is performed on a word group to obtain a plurality of word groups, secondly, the pinyin of each initial character in the plurality of initial characters is determined according to the ith pinyin sequence, thirdly, the pinyin of each word group in the plurality of word groups is determined based on the pinyin of each initial character in the plurality of initial characters, and finally, the word group pinyin sequence can be generated according to the pinyin of each word group in the plurality of word groups.
Along the above example, in the case of obtaining the first pinyin sequence { ying2-zhao2-chao1-yang2-zuo4-che1} and the first element phrase { facing towards }, the second element phrase { facing towards the sun }, the third element phrase { facing towards the sun seat }, and the fourth element phrase { facing towards the sun seat }, first performing word-splitting processing on the initial text "facing towards the sun seat car" to obtain a plurality of initial characters (facing, landing, facing towards, sun, seat, car), and simultaneously, performing word division processing on the first word group { facing towards the sun } to obtain a plurality of meta characters (facing towards the sun ), performing word division processing on the second word group { facing towards the sun } to obtain a plurality of meta characters (facing towards the sun, sitting on the sun), performing word division processing on the third word group { facing towards the sun, sitting on the sun, and performing word division processing on the fourth word group { facing towards the sun, sitting on the car) to obtain a plurality of meta characters (facing towards the sun, sitting on the car).
Then, according to the first pinyin sequence { ying2-zhao2-chao1-yang2-zuo4-che1} determining that the pinyin of each initial character in the plurality of initial characters is ("facing" - "ying2", "facing" - "zhao2", "facing" - "chao1", "yang" - "yang", "yang2", "sitting" - "zuo4", "car" - "che 1"), the pinyin of each element character in each element phrase can be determined according to the pinyin of each initial character in the plurality of initial characters, i.e., the pinyin of each element character in the first element phrase is ("facing" - "ying2", "facing" - "zhao2", "facing" - "chao") and the pinyin of each element character in the second element phrase is ("facing" - "ying2", "facing" - "zhao2", "facing" - "chao1", "yang" - "yang 2") and the pinyin of each element character in the third element phrase is "facing" - "ying", "facing" - "zo 2", "facing" - "z 38 2", "facing" - "z", "chao", "facing", "yang2", "facing" - "and" 6 "-" and "is" the four-element phrase "6" - "facing" - "and" 6 "-" facing "-" 6 "and" each element character "in the first element phrase.
Finally, based on the pinyin of each element character in each element phrase, the first phrase pinyin sequence of the first element phrase is { ying2-zhao2-chao1}, the second phrase pinyin sequence of the second element phrase is { ying-zhao 2-chao1-yang2}, the third phrase pinyin sequence of the third phrase is { ying-zhao 2-chao1-yang2-zuo4}, the fourth phrase pinyin sequence of the fourth phrase is { ying-zhao 2-chao1-yang2-zuo4-che1}, and then the phrase pinyin sequences of the fourth phrase are input into a cloud input method to generate a reference text.
In conclusion, the reference phrase is generated by combining the phrase pinyin sequence of the element phrase with the generation processing module, so that the accuracy of the reference phrase can be improved, the pinyin correctness of the polyphone in the element phrase can be verified, the verification accuracy is effectively improved, and the subsequent updating or construction efficiency of the polyphone text library is further improved.
Further, after the reference word is obtained, the reference word group and the element word group are compared, if the reference word group is inconsistent with the element word group, step S108 is executed, and if the reference word group is consistent with the element word group, step S110 is executed.
It should be noted that, in the process of comparing the reference phrase and the element phrase, it is actually compared whether the polyphones in the element phrase appear in the reference phrase, and whether the positions of the polyphones and the element phrase are the same, so as to assist in analyzing whether the pinyin of the polyphones in the ith pinyin sequence is correct.
Step S108, i is increased by 1 automatically under the condition that the word group and the reference word group are inconsistent, and step S104 is executed in a return mode.
Specifically, under the condition that the word group is inconsistent with the reference word group, the fact that the multi-tone words in the word group are different from the characters in the reference word group is further explained, the fact that the pinyin of the multi-tone words in the ith pinyin sequence is wrong is further explained, i can be increased by 1 at the moment, the next pinyin sequence (the pinyin sequence generated based on the other pronunciation of the multi-tone words) of the initial text is determined, and then the step S104 is executed again, and the pinyin verification process of the multi-tone words is carried out again.
Along the above example, under the condition that the first reference phrase "reflection super", the second reference phrase "reflection super-poplar" and the third reference phrase "reflection super-poplar" and the fourth reference phrase are "reflection super-poplar sitting car", at this time, the first reference phrase and the first element phrase are compared, the second reference phrase and the second element phrase are compared, the third reference phrase and the third element phrase are compared, and the fourth reference phrase and the fourth element phrase are compared, and the four comparison results are inconsistent, then the pinyin "zhao" generated for the multi-tone word in the first pinyin sequence { ying-zhao 2-chao1-yang2-zuo4-che1} is considered to be incorrect, at this time, the second pinyin sequence { ying2-zhe 5-chao-yang 2-zuo4-che1} is selected for verification processing, and the specific processing procedure can be referred to the corresponding description based on the first pinyin sequence processing procedure.
In addition, after i is increased by 1, a problem that all pinyin sequences corresponding to the initial text are verified may occur, that is, the value after i is increased by 1 is greater than the value of the number of pronunciations corresponding to the polyphones, so that verification processing cannot be performed any more, and it is further explained that the initial text may not have correct pronunciations, at this time, the initial text may be deleted from the standard text library and written into the non-standard text library for use in other business processing, and in this embodiment, the specific implementation manner is as follows:
detecting whether the (i+1) th pinyin sequence is consistent with the (i) th pinyin sequence;
if not, executing the step of constructing at least one word group containing the polyphones according to the polyphone identification and the initial text;
and if so, writing the initial text into a non-standard text library.
Specifically, the non-standard text library is a text library for temporarily storing initial text which cannot be used, based on the text library, when detecting that the (i+1) th pinyin sequence is inconsistent with the (i) th pinyin sequence, the method indicates that the pinyin of the polyphones in the current (i+1) th pinyin sequence is not verified yet, and then the verification process is executed again; under the condition that the (i+1) th pinyin sequence is detected to be consistent with the (i) th pinyin sequence, the fact that all pinyin sequences corresponding to the initial text are verified is indicated, correct pinyin of polyphones in the initial text is not found, the initial text can be deleted from a standard text library, and the initial text is added into a non-standard text library or cleared, so that storage resources of the standard text library are released in time, and waste of the storage resources is avoided.
Step S110, under the condition that the word group is consistent with the reference word group, a text pinyin group is created based on the polyphone identification, the initial text and the ith pinyin sequence, and the polyphone text library is written.
Specifically, under the condition that the word group is consistent with the reference word group, the multi-tone word in the word group is identical with the characters in the reference word group, and the fact that the spelling of the multi-tone word in the ith spelling sequence is correct is further explained, at the moment, the text spelling group can be created according to the multi-tone word mark, the initial text and the ith spelling sequence, and the text spelling group can be written into the multi-tone word text library; the text pinyin group specifically includes a multi-tone character identifier, and a combined expression of an initial text and multi-tone character pinyin, such as the text pinyin group in the multi-tone character text library in fig. 2.
Further, the specific process of generating the text pinyin group is as follows:
determining the pinyin position of the pinyin corresponding to the polyphones in the ith pinyin sequence based on the polyphones identification;
extracting the pinyin corresponding to the polyphones from the ith pinyin sequence according to the pinyin position;
And integrating the initial text, the polyphone identifier and the pinyin corresponding to the polyphone to obtain the text pinyin group.
Specifically, because the pinyin contained in the ith pinyin sequence corresponds to each character in the initial text, the pinyin position of the pinyin corresponding to the polyphone in the ith pinyin sequence can be determined through the polyphone identification, the pinyin corresponding to the polyphone is extracted from the ith pinyin sequence according to the pinyin position, and finally the initial text, the polyphone identification and the pinyin corresponding to the polyphone are integrated, so that the text pinyin group can be obtained.
Along the above example, after the second pinyin sequence { ying-zhe 5-chao-yang 2-zuo4-che1} is selected for verification processing, at this time, it is determined that the pinyin position of the multi-pinyin word corresponding to the multi-pinyin word in the second pinyin sequence { ying-zhe 5-chao-yang 2-zuo 4-1 } is the second position, and the pinyin "zhe" generated for the multi-pinyin word in the second pinyin sequence { ying2-zhe 5-chao-yang 2-zuo4-che1} is correct, so that a text pinyin group can be generated by combining the multi-pinyin word identifier with the initial text in the subsequent step, that is, the multi-pinyin word corresponding to the multi-pinyin word identifier "1" is determined to be the second pinyin sequence { ying-zhe 5-chao-yang 2-zuo 4-1 } and the multi-pinyin word is determined to be the multi-pinyin word in the second pinyin sequence { ying2-zhe 5-chao-yang 2-zuo 4-1 } according to the position, and the multi-pinyin word identifier "zhe 1" is written in the multi-pinyin word 1 "to the sitting car, so that the multi-pinyin word can be used for the sitting car.
In summary, the text pinyin group is created by integrating the initial text, the polyphone identifier and the pinyin corresponding to the polyphone, so that the normalization of the text pinyin group can be ensured, and the regularity of the data of the polyphone text library is further ensured, thereby being convenient for quick calling and use when the downstream service is used, and effectively improving the service completion efficiency of the downstream service.
In addition, after the updating or construction of the standard text library is completed, at this time, the service completion efficiency of the downstream service may be promoted according to the text pinyin group included in the standard text library, and in this embodiment, the downstream service is described by taking the model training service as an example, and the specific implementation process refers to a process flow chart of the model training process shown in fig. 6:
Step S1102, under the condition that a reading request submitted for the multi-sound word text library is received, reading a training text pinyin group in the multi-sound word text library according to the reading request;
Step S1104, analyzing the training text pinyin group to obtain a training initial text and a training pinyin sequence;
step S1106, training the initial pinyin labeling model based on the training initial text and the training pinyin sequence to obtain a target pinyin labeling model.
Specifically, under the condition that a reading request submitted by the multi-tone word text library is received, the fact that the text pinyin group in the multi-tone word text library is required to be used for training a model is described, the reading request can be analyzed, the number of the text pinyin group required to be read is determined, namely, the training text pinyin group used for training an initial pinyin annotation model is read in the multi-tone word text library according to the reading request, wherein the initial pinyin annotation model is used for carrying out pinyin translation on characters in a text, and in order to improve the accuracy of pinyin translation, the training is realized in a semantic analysis mode, and therefore the annotated pinyin is ensured to be the correct pinyin corresponding to the text.
Based on the above, after the training text pinyin group is obtained, the training text pinyin group is analyzed, so that training initial texts, training pinyin sequences and multi-tone character identifiers contained in the training text pinyin group can be obtained, finally the training initial texts are used as input of the initial pinyin annotation model, the training pinyin sequences are used as output of the initial pinyin annotation model, the initial pinyin annotation model is trained, and finally the target pinyin annotation model meeting the use requirements is obtained.
In practical application, when the initial pinyin labeling model is trained by training the initial text and training the pinyin sequence, whether to stop training can be determined by monitoring the loss function value or whether to stop training can be determined by monitoring the model output accuracy, so that the target pinyin labeling model meeting the use requirement can be obtained.
According to the text processing method provided by the specification, after the initial text containing the multi-tone words is obtained, the pinyin sequence of the initial text is determined, at least one word group containing the multi-tone words is built based on the multi-tone word identification carried by the initial text, then the word group pinyin sequence of the word group is determined according to the obtained pinyin sequence, meanwhile, a reference word group is generated based on the word group pinyin sequence, then the correctness of the pinyin sequence is checked in a mode of comparing the reference word group with the word group pinyin sequence, if the check result is inconsistent, the new pinyin sequence of the initial text is redetermined, the process is executed until the check result is consistent, the correct pinyin of the multi-tone words in the initial text can be determined, the pinyin sequence, the multi-tone word identification and the initial text are integrated into a text pinyin group under the condition of consistent check result, and the multi-tone word text library is written, so that when the pinyin labels are carried on the multi-tone words in the initial text, the correct pinyin group can be determined in a verification mode, manpower and material resources can be saved, the accuracy of the finally created text group can be effectively ensured, the correct text is realized, the quality of the multi-tone words can be further built in a corresponding database, and the quality of the business is further, and the quality of the business is improved.
Corresponding to the above method embodiment, the present disclosure further provides an embodiment of a text processing device, and fig. 7 shows a schematic structural diagram of a text processing device according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:
An obtaining module 702, configured to obtain an initial text carrying a polyphone identifier, where the initial text includes at least one polyphone;
A determining module 704, configured to determine an ith pinyin sequence corresponding to the initial text, and construct at least one word group including the polyphones according to the polyphone identifier and the initial text, where i is a positive integer and a value from 1;
The processing module 706 is configured to determine a phrase pinyin sequence of the element phrase according to the ith pinyin sequence, and input the phrase pinyin sequence to the text generation module for processing, so as to obtain a reference phrase corresponding to the phrase pinyin sequence;
In case the word group and the reference word group are inconsistent, i is increased by 1, and the determining module 704 is operated;
in case the word group and the reference word group are identical, a writing module 708 is executed, the writing module 708 being configured to create a text pinyin group based on the polyphone identification, the initial text and the i-th pinyin sequence, and write a polyphone text library.
In an alternative embodiment, the text processing device further includes:
The acquisition module is configured to acquire a text to be processed, and normalize the text to be processed to obtain a standard text;
the marking module is configured to determine standard polyphones in the standard text based on a preset polyphone dictionary and mark the standard polyphones;
and the writing-in standard text library module is configured to obtain standard texts carrying the polyphone marks according to the marking result, and write the standard texts carrying the polyphone marks into the standard text library.
In an alternative embodiment, the obtaining module 702 is further configured to:
And under the condition that an updating request for updating the multi-tone word text library is received, extracting the initial text carrying a multi-tone word identifier from the standard text library based on the updating request, wherein the multi-tone word identifier is used for marking the character position of at least one multi-tone word contained in the initial text.
In an alternative embodiment, the determining module 704 is further configured to:
inputting the initial text to a pinyin generation module for processing to obtain an ith pinyin sequence corresponding to the initial text output by the pinyin generation module, wherein i is a value from 1 and is a positive integer.
In an alternative embodiment, the determining module 704 includes:
a determining character position unit configured to determine a character position of the polyphones in the initial text based on the polyphones identification;
determining adjacent character units, wherein the adjacent character units are configured to determine adjacent character positions adjacent to the character positions through a preset selection strategy, and determine adjacent characters corresponding to the adjacent character positions according to the initial text;
And a group forming unit configured to construct at least one group of the adjacent words and the polyphones in accordance with the arrangement order of the adjacent words and the polyphones in the initial text.
In an alternative embodiment, the processing module 706 includes:
the preprocessing unit is configured to preprocess the initial text to obtain a plurality of initial characters and preprocess the meta word group to obtain a plurality of meta characters;
A first determining pinyin unit configured to determine pinyin for each of the plurality of initial characters based on the i-th pinyin sequence;
A second determining pinyin unit configured to determine pinyin for each of the plurality of meta-characters based on pinyin for each of the plurality of initial characters;
And the phrase spelling sequence unit is configured to generate the phrase spelling sequence according to the spelling of each meta-character in the plurality of meta-characters.
In an alternative embodiment, the text processing device further includes:
a detection module configured to detect whether an i+1th pinyin sequence is consistent with the i-th pinyin sequence;
If not, then the determination module 704 is run;
and if the initial text is consistent with the initial text, operating a writing text library module, wherein the writing text library module is configured to write the initial text into a non-standard text library.
In an alternative embodiment, the writing module 708 includes:
a pinyin location determining unit configured to determine a pinyin location of a pinyin corresponding to the polyphones in the ith pinyin sequence based on the polyphones identification;
The Pinyin extraction unit is configured to extract the Pinyin corresponding to the polyphones in the ith Pinyin sequence according to the Pinyin position;
and the integration unit is configured to integrate the initial text, the polyphone identifier and the pinyin corresponding to the polyphone to obtain the text pinyin group.
In an alternative embodiment, the text processing device further includes:
The reading module is configured to read a training text pinyin group in the multi-tone word text library according to a reading request submitted for the multi-tone word text library under the condition that the reading request is received;
The analysis module is configured to analyze the training text pinyin group to obtain a training initial text and a training pinyin sequence;
and the training module is configured to train the initial pinyin annotation model based on the training initial text and the training pinyin sequence to obtain a target pinyin annotation model.
In an alternative embodiment, the initial text is an initial chinese text, and the pinyin contained in the ith pinyin sequence has a tone.
According to the text processing device provided by the embodiment, after the initial text containing the multi-tone word is obtained, the pinyin sequence of the initial text is determined, at least one word group containing the multi-tone word is built based on the multi-tone word identification carried by the initial text, then the word group pinyin sequence of the word group is determined according to the obtained pinyin sequence, meanwhile, a reference word group is generated based on the word group pinyin sequence, then the correctness of the pinyin sequence is checked in a mode of comparing the reference word group with the word group, if the check result is inconsistent, the new pinyin sequence of the initial text is redetermined, the process is executed until the check result is consistent, the correct pinyin of the multi-tone word in the initial text can be determined, the pinyin sequence, the multi-tone word identification and the initial text are integrated into a text pinyin group under the condition of consistent check result, and written into a multi-tone word text library, so that when the multi-tone word in the initial text is subjected to pinyin annotation, the correct pinyin of the multi-tone word can be determined in a verification mode, manpower and material resources can be saved, the accuracy of the finally created text group can be effectively ensured, and the quality of the multi-tone word can be further improved when the corresponding text is built, and the quality of the business is further due to the fact that the business is not is improved.
The above is an exemplary scheme of a text processing apparatus of the present embodiment. It should be noted that, the technical solution of the text processing apparatus and the technical solution of the text processing method belong to the same concept, and details of the technical solution of the text processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the text processing method.
Fig. 8 illustrates a block diagram of a computing device 800 provided in accordance with an embodiment of the present specification. The components of computing device 800 include, but are not limited to, memory 810 and processor 820. Processor 820 is coupled to memory 810 through bus 830 and database 850 is used to hold data.
Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 8 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 800 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.
Wherein processor 820 is configured to execute computer-executable instructions for:
acquiring an initial text carrying a polyphone mark, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one word group containing the polyphones according to the polyphone identification and the initial text, wherein i is a value from 1 and is a positive integer;
Determining a phrase pinyin sequence of the element phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence into a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
Under the condition that the word group is inconsistent with the reference word group, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the word group is consistent with the reference word group, creating a text pinyin group based on the polyphone identification, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the text processing method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the text processing method.
An embodiment of the present disclosure also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, are configured to:
acquiring an initial text carrying a polyphone mark, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one word group containing the polyphones according to the polyphone identification and the initial text, wherein i is a value from 1 and is a positive integer;
Determining a phrase pinyin sequence of the element phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence into a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
Under the condition that the word group is inconsistent with the reference word group, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the word group is consistent with the reference word group, creating a text pinyin group based on the polyphone identification, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the text processing method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the text processing method.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present description is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present description. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all necessary in the specification.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the disclosure and the practical application, to thereby enable others skilled in the art to best understand and utilize the disclosure. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims (13)

1. A text processing method, comprising:
acquiring an initial text carrying a polyphone mark, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one word group containing the polyphones according to the polyphone identification and the initial text, wherein i is a value from 1 and is a positive integer;
Determining a phrase pinyin sequence of the element phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence into a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
Under the condition that the word group is inconsistent with the reference word group, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the word group is consistent with the reference word group, creating a text pinyin group based on the polyphone identification, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
2. The text processing method according to claim 1, wherein before the step of obtaining the initial text carrying the polyphone identifier is performed, further comprising:
collecting a text to be processed, and carrying out normalization processing on the text to be processed to obtain a standard text;
determining a standard polyphone in the standard text based on a preset polyphone dictionary, and marking the standard polyphone;
And obtaining a standard text carrying the polyphone mark according to the marking result, and writing the standard text carrying the polyphone mark into a standard text library.
3. The method for processing text according to claim 2, wherein the obtaining the initial text carrying the polyphone identifier includes:
And under the condition that an updating request for updating the multi-tone word text library is received, extracting the initial text carrying a multi-tone word identifier from the standard text library based on the updating request, wherein the multi-tone word identifier is used for marking the character position of at least one multi-tone word contained in the initial text.
4. The method for processing text according to claim 1, wherein the determining the i-th pinyin sequence corresponding to the initial text includes:
inputting the initial text to a pinyin generation module for processing to obtain an ith pinyin sequence corresponding to the initial text output by the pinyin generation module, wherein i is a value from 1 and is a positive integer.
5. The text processing method of claim 1, wherein said constructing at least one word group including said polyphones from said polyphone identification and said initial text comprises:
determining a character position of the polyphones in the initial text based on the polyphones identification;
Determining adjacent character positions adjacent to the character positions through a preset selection strategy, and determining adjacent words corresponding to the adjacent character positions according to the initial text;
and constructing at least one word group consisting of the adjacent words and the polyphones according to the arrangement sequence of the adjacent words and the polyphones in the initial text.
6. The text processing method according to claim 1, wherein the determining the phrase pinyin sequence of the element phrase according to the i-th pinyin sequence includes:
preprocessing the initial text to obtain a plurality of initial characters, and preprocessing the meta word group to obtain a plurality of meta characters;
determining the pinyin of each initial character in the plurality of initial characters according to the ith pinyin sequence;
Determining pinyin for each of the plurality of meta-characters based on pinyin for each of the plurality of initial characters;
And generating the phrase pinyin sequence according to the pinyin of each meta character in the plurality of meta characters.
7. The text processing method according to claim 1, wherein after i is increased by 1 and the step of determining the i-th pinyin sequence corresponding to the initial text is performed in the case that the meta word group and the reference word group are inconsistent, the method further comprises:
detecting whether the (i+1) th pinyin sequence is consistent with the (i) th pinyin sequence;
if not, executing the step of constructing at least one word group containing the polyphones according to the polyphone identification and the initial text;
and if so, writing the initial text into a non-standard text library.
8. The text processing method of claim 1, wherein the creating a text pinyin group based on the polyphone identification, the initial text, and the i-th pinyin sequence includes:
determining the pinyin position of the pinyin corresponding to the polyphones in the ith pinyin sequence based on the polyphones identification;
extracting the pinyin corresponding to the polyphones from the ith pinyin sequence according to the pinyin position;
And integrating the initial text, the polyphone identifier and the pinyin corresponding to the polyphone to obtain the text pinyin group.
9. The text processing method of claim 1, wherein after the steps of creating a text pinyin group based on the polyphone identification, the initial text, and the i-th pinyin sequence and writing to a polyphone text library are performed, further comprising:
Under the condition that a reading request submitted for the multi-word text library is received, reading a training text pinyin group in the multi-word text library according to the reading request;
analyzing the training text pinyin group to obtain a training initial text and a training pinyin sequence;
And training the initial pinyin labeling model based on the training initial text and the training pinyin sequence to obtain a target pinyin labeling model.
10. The text processing method of claim 1, wherein the initial text is an initial chinese text and the pinyin contained in the i-th pinyin sequence has a tone.
11. A text processing apparatus, comprising:
the acquisition module is configured to acquire an initial text carrying a polyphone identifier, wherein the initial text comprises at least one polyphone;
the determining module is configured to determine an ith pinyin sequence corresponding to the initial text, and construct at least one word group containing the polyphones according to the polyphone identification and the initial text, wherein i takes a value from 1 and is a positive integer;
The processing module is configured to determine a phrase pinyin sequence of the element phrase according to the ith pinyin sequence, input the phrase pinyin sequence into the text generation module for processing, and obtain a reference phrase corresponding to the phrase pinyin sequence;
under the condition that the word group is inconsistent with the reference word group, i is increased by 1, and the determining module is operated;
And under the condition that the word group is consistent with the reference word group, a writing module is operated, and the writing module is configured to create a text pinyin group based on the polyphone identifier, the initial text and the i-th pinyin sequence and write the text pinyin group into a polyphone text library.
12. A computing device, comprising:
a memory and a processor;
The memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions to implement the method of:
acquiring an initial text carrying a polyphone mark, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one word group containing the polyphones according to the polyphone identification and the initial text, wherein i is a value from 1 and is a positive integer;
Determining a phrase pinyin sequence of the element phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence into a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
Under the condition that the word group is inconsistent with the reference word group, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the word group is consistent with the reference word group, creating a text pinyin group based on the polyphone identification, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
13. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the text processing method of any one of claims 1 to 10.
CN202011133952.2A 2020-10-21 2020-10-21 Text processing method and device Active CN112257420B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011133952.2A CN112257420B (en) 2020-10-21 2020-10-21 Text processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011133952.2A CN112257420B (en) 2020-10-21 2020-10-21 Text processing method and device

Publications (2)

Publication Number Publication Date
CN112257420A CN112257420A (en) 2021-01-22
CN112257420B true CN112257420B (en) 2024-06-18

Family

ID=74264493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011133952.2A Active CN112257420B (en) 2020-10-21 2020-10-21 Text processing method and device

Country Status (1)

Country Link
CN (1) CN112257420B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105336322A (en) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 Polyphone model training method, and speech synthesis method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000010964A (en) * 1998-06-17 2000-01-14 Toshiba Corp Chinese input conversion processor, chinese input conversion processing method and recording medium recording chinese input conversion processing program
US7478033B2 (en) * 2004-03-16 2009-01-13 Google Inc. Systems and methods for translating Chinese pinyin to Chinese characters
CN105404621B (en) * 2015-09-25 2018-07-10 中国科学院计算技术研究所 A kind of method and system that Chinese character is read for blind person
CN109977361A (en) * 2019-03-01 2019-07-05 广州多益网络股份有限公司 A kind of Chinese phonetic alphabet mask method, device and storage medium based on similar word
CN111667810B (en) * 2020-06-08 2021-10-15 北京有竹居网络技术有限公司 Method and device for acquiring polyphone corpus, readable medium and electronic equipment
CN111798834B (en) * 2020-07-03 2022-03-15 北京字节跳动网络技术有限公司 Method and device for identifying polyphone, readable medium and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105336322A (en) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 Polyphone model training method, and speech synthesis method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
最大熵算法在汉语拼音标注中的研究与实现;张丽青;寿永熙;马志强;;微电子学与计算机;20120805(08);全文 *

Also Published As

Publication number Publication date
CN112257420A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
CN110444198B (en) Retrieval method, retrieval device, computer equipment and storage medium
CN108847241B (en) Method for recognizing conference voice as text, electronic device and storage medium
CN109887497B (en) Modeling method, device and equipment for speech recognition
CN110033760B (en) Modeling method, device and equipment for speech recognition
CN111341305B (en) Audio data labeling method, device and system
CN111292751B (en) Semantic analysis method and device, voice interaction method and device, and electronic equipment
CN106570180A (en) Artificial intelligence based voice searching method and device
CN112951275B (en) Voice quality inspection method and device, electronic equipment and medium
CN111899740A (en) Voice recognition system crowdsourcing test case generation method based on test requirements
CN112257437A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN111354340B (en) Data annotation accuracy verification method and device, electronic equipment and storage medium
CN111881297A (en) Method and device for correcting voice recognition text
CN112259083A (en) Audio processing method and device
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN110826301B (en) Punctuation mark adding method, punctuation mark adding system, mobile terminal and storage medium
CN113782026A (en) Information processing method, device, medium and equipment
CN111354354B (en) Training method, training device and terminal equipment based on semantic recognition
CN111737424A (en) Question matching method, device, equipment and storage medium
CN112699671B (en) Language labeling method, device, computer equipment and storage medium
CN112201253A (en) Character marking method and device, electronic equipment and computer readable storage medium
CN112257420B (en) Text processing method and device
CN115691503A (en) Voice recognition method and device, electronic equipment and storage medium
CN113470617B (en) Speech recognition method, electronic equipment and storage device
CN115510192A (en) News event context relationship detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant