CN111090970A

CN111090970A - Text standardization processing method after speech recognition

Info

Publication number: CN111090970A
Application number: CN201911417452.9A
Authority: CN
Inventors: 邱瑾; 时猛
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-01
Anticipated expiration: 2039-12-31
Also published as: CN111090970B

Abstract

The embodiment of the invention provides a text standardization processing method after voice recognition. The method comprises the following steps: setting a text conversion matching rule set in the badcase module according to the ITN error collected in the badcase module and fed back by the customer service; inputting the pure language text to be standardized after voice recognition into a badcase module, and outputting the text after reverse text marking; replacing at least one character of the reverse text mark in the text output by the badcase module with a corresponding number of special symbols, wherein the special symbols are selected from symbols which cannot be converted by a neural network model; inputting the text into a binary neural network model, outputting 0/1 a label sequence and determining a confidence level that the model can convert the second text; determining processing texts with different rules through confidence judgment; and replacing the special symbol in the processed text by using one cached word to determine a text standardization result of the pure language text. The embodiment of the invention improves the text standardization processing speed and precision and is suitable for large-scale data processing.

Description

Text standardization processing method after speech recognition

Technical Field

The invention relates to the field of text processing, in particular to a text standardization processing method after voice recognition.

Background

To prevent some words that should not be converted from being converted into other forms of words in speech recognition, for example, text that mistakenly converts the date of 2018-08-08 into two-zero-eight, reverse text normalization is typically performed. The reverse text normalization will convert the content that needs the reverse text normalization into a tagging problem using a simple set of rules and some manually written grammar. For the labeling problem, a compact two-way LSTM (Long short term memory) is used. Assigning a label to each input mark in the voice form to obtain a corresponding writing form segment and a starting and ending position of subsequent processing; generating a writing form character string by applying certain editing; the marked area is processed using the post-processing syntax.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the reverse text standardization system completely established by the manual rule needs more professional knowledge in language, the rule can not utilize the semantic information of the text, a reasonable and accurate rule-based system needs to construct and maintain a large-scale and complex rule file specific to the language, and the method is not suitable for processing large-scale data; the mixed method of the manual grammar and the statistical model has low parallelization degree, possibly insufficient grammar coverage, influenced processing effect and low time efficiency.

Disclosure of Invention

The method aims to solve the problems that in the prior art, when words needing inverse text standardization are determined in the text standardization process, an inverse text standardization system completely established by manual rules is not suitable for processing large-scale data, the parallelization degree of manual grammar and a statistical model is low, grammar coverage is possibly insufficient, the processing effect is poor, and the efficiency is low.

In a first aspect, an embodiment of the present invention provides a text normalization processing method after speech recognition, including:

setting a text conversion matching rule set in the badcase module according to the ITN error collected in the badcase module and fed back by the customer service;

inputting a plain language text to be standardized to the badcase module, caching at least one word in the plain language text when the at least one word hits a matching rule in the set, and outputting the word after reverse text marking;

replacing at least one character of the reverse text mark in the text output by the badcase module with a corresponding number of special symbols to obtain a first processed text, wherein the special symbols are selected from symbols which cannot be converted by a neural network model;

inputting the first processed text into a binary neural network model, outputting 0/1 a label sequence and determining a confidence level that the model can convert the first processed text, wherein 0 represents no conversion characters and 1 represents conversion characters;

when the confidence coefficient is greater than or equal to a preset threshold value, inputting the label sequence into a first rule set for matching, and performing text standardization conversion on characters corresponding to the label 1 to obtain a second processed text;

when the confidence coefficient is smaller than a preset threshold value, inputting the first processed text into a second rule set for matching, and performing text standardization conversion on the plain language text to obtain a second processed text, wherein the number of rules in the first rule set is smaller than that in the second rule set;

and replacing the special symbols in the second processed text by using the cached at least one word, and determining a text standardization result of the plain language text.

In a second aspect, an electronic device is provided, comprising: the text normalization processing method comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the steps of the text normalization processing method after voice recognition according to any embodiment of the invention.

In a third aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the steps of the text normalization processing method after speech recognition according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: on the basis of not modifying the existing inverse text standardization method, the badcase module can be used for quickly setting the matching rule set with strong pertinence and more bias, is more suitable for engineering large-scale data processing, can be used for processing large-scale data by training the neural network model, can learn more information and improve the processing speed and precision.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a text normalization processing method after speech recognition according to an embodiment of the present invention;

fig. 2 is a system flow diagram of a text normalization processing method after speech recognition according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a text normalization processing method after speech recognition according to an embodiment of the present invention, which includes the following steps:

s11: setting a text conversion matching rule set in the badcase module according to the ITN error collected in the badcase module and fed back by the customer service;

s12: inputting a plain language text to be standardized to the badcase module, caching at least one word in the plain language text when the at least one word hits a matching rule in the set, and outputting the word after reverse text marking;

s13: replacing at least one character of the reverse text mark in the text output by the badcase module with a corresponding number of special symbols to obtain a first processed text, wherein the special symbols are selected from symbols which cannot be converted by a neural network model;

s14: inputting the first processed text into a binary neural network model, outputting 0/1 a label sequence and determining a confidence level that the model can convert the first processed text, wherein 0 represents no conversion characters and 1 represents conversion characters;

s15: when the confidence coefficient is greater than or equal to a preset threshold value, inputting the label sequence into a first rule set for matching, and performing text standardization conversion on characters corresponding to the label 1 to obtain a second processed text;

s16: and replacing the special symbols in the second processed text by using the cached at least one word, and determining a text standardization result of the plain language text.

For step S11, in the Text Normalization process after speech recognition, in order to prevent some words from being converted into other types of words, ITN (Inverse Text Normalization) is introduced, however, the ITN system may erroneously convert some words that cannot be matched by the existing rules and emerging words into other forms of Text. The customer service side, when in use, will feedback these ITN errors. The general flow chart is shown in fig. 2. For example, the names of songs released by star have a relative personality, when the text entered by the user is identified as "songs like and like" but due to the erroneous conversion of ITN, the text is converted into "songs 23 like and like", which is a clear error and unacceptable to the user.

Through ITN mistake based on customer service feedback that badcase module gathered constantly, can set for these pertinence stronger, the matching rule set of comparison bias, because the text conversion matching rule set in the badcase module all is more unique, the rule quantity in the text conversion matching rule set is far less than the rule set quantity in the existing method.

For step S12, the plain text to be standardized is determined, for example, in the interaction between the user and the smart speaker, the text is entered into the badcase module by recognizing that "i want to listen to a julian song in wu also three afternoon two-zero nine-year thirty-one" and "wushu song in wushu" hits the matching rule in the text conversion matching rule set, and the word "wushu song ersan" is cached for later use. And carries out reverse text marking on the Wu-and-Fa song Ersan. For example:

i want to listen to a Wu-also song Ersanmen at three points in the afternoon of twelve and nine years [ -Wu-also song three points in the afternoon of twelve and nine years ]

In this embodiment, the plain language text includes at least: pure Chinese text without Arabic numerals, pure English text without Arabic numerals, and Chinese-English mixed text without Arabic numerals.

The method can process the text of pure Chinese characters, the text of pure English and the text of Chinese and English mixture. As long as the rules match, it can be processed.

For step S13, the characters of the reverse text label in the badcase module output text are replaced with a corresponding number of special symbols. The special symbols are those symbols which cannot be transformed by the neural network model, e.g. special symbols

Such a notation ensures that no influence is exerted on the model behind, and the replaced first processed text "i should listen to one in two-zero-nine-year-twelve-month-thirty-day afternoon

”。

For step S14, the neural network model needs to be trained, and the neural network has learning ability compared with rule matching, so that rules can be automatically learned from a large amount of data, thereby simplifying manual summary of data rules without manually analyzing data and compiling rules. For example, the NN model may be a Transformer model, which is a binary model, where the input model is a word vector, the output of the model is a direct value of 0 to 1, and represents a probability value of the character, where the probability value is greater than 0.5, that is, the character is labeled as 1, and otherwise, the label is 0. There are no matching rules in the neural network model. Because the rules are automatically learned from a large amount of data, the neural network model can handle some common matching rules.

The vector of the first processed text is determined and input into a neural network model, for example, assuming that the dimension of the word vector is 200 dimensions, the maximum number of words contained in a sentence is 50, zero padding is performed if the number of words is less than 50, truncation is performed if the number of words exceeds 50, and the truncated words are input as another part. The input is a 50 x 200 vector matrix, and the output is a 50 x 1 vector, each vector element represents the probability of whether a character is converted into an Arabic numeral.

The confidence coefficient is the capability of judging whether to trust the model by comparing the probability of each character with a preset threshold value, if the confidence coefficient is greater than the threshold value, the model is more confident to correctly convert the text, and if the confidence coefficient is less than the model, the possibility of the model being correct is lower.

The conversion is performed through the steps, for example, I listen to one head at three points in twelve months and thirty days in the afternoon of two-zero-nine years

"the output is" 00011110110110001000000000000 "and the confidence of the determination is 0.98.

And S15, judging whether the model has a larger confidence to correctly convert the text according to the determined preset threshold, and if the determined confidence is 0.98 and is larger than the threshold 0.9, indicating that the model has a larger confidence to correctly convert the text. By respectively corresponding the labels, it is possible to obtain: "i-0", "want-0", "in-0", "two-1", "zero-1", "one-1", "nine-1", "year-0", "ten-1", "two-1", "month-0", "three-1", "ten-1", "day-0", "down-0", "noon-0", "three-1", "point-0", "listen-0", "one-0", "head-0",

-0”，

and determining that the corresponding texts are respectively 'two zero one nine', 'twelve', 'thirty' and 'three'. As an embodiment, the text normalization conversion of the text includes: and converting the text numeric characters corresponding to the label 1 into Arabic numeric characters.

In the present embodiment, "two zero and one nine" → "2019", "twelve" → "12", "thirty" → "30", and "three" → "3" are provided. Because the text conversion rule in the first rule is simple, the realized function is only to convert the number into the Arabic number, and the calculation speed of the part is higher and the time consumption is less.

At this time, get "I want to be under 2019, 12 months and 30 daysListen to one at 3 am

”。

If the confidence is 0.7, the confidence is less than 0.9 of the preset threshold. The explanation model does not have confidence that the text is correctly converted. Therefore, no neural network is used for processing. At this time, we will listen to one head at three different points in twelve months and thirty afternoons in two-zero-nine years

"input into the second set of rules for matching.

There are many complex rules in the second rule set, such as:

date9＝(${year}+)(${month}+)(${day_3})(${xiaoshi}？)(${minute}？)(${second}？)；

_date9＝(${date9})＝>(_badcase,a0＝"$1",a1＝"$2",a2＝"$3",a3＝"$4",a4＝"$5",a5＝"$6")；

export__date9＝(${_date9})＝>(_array,a0＝"$1")；

through the complex rules, the first processed text which is not provided with the confidence of the model to correctly convert the text can be more correctly converted, and the processing speed is relatively slow because the rules of the part are more complex.

At this time, we can get "I want to listen to 1 in 2019 at 30 pm 3 am 12 months

”。

For step S16, the special symbols in the determined text are finally replaced with words in the buffer. And obtaining a text standardization result of the pure language text. For example, will "I want to listen to a sound at 3 pm on 12/30/2019

'replace back' ITo listen to a song ersansan in Wu Fang at 3 pm in 12/30/2019.

According to the embodiment, on the basis that the existing reverse text standardization method is not modified, the badcase module can be used for quickly setting the pertinence to be stronger, comparing the matching rule set of the goal, and being more suitable for engineering large-scale data processing.

As an implementation manner, in this embodiment, after the determining the text normalization result of the plain text, the method further includes:

and when an ITN error fed back by the customer service to the text standardization result is received, extracting a new matching rule corresponding to the ITN error, and storing the new matching rule to the text conversion matching rule set so as to update the text conversion matching rule set.

In this embodiment, after the text normalization result is obtained by the above method, the user may judge it again, and if an ITN error occurs at this time, a new current conversion matching rule set may be further determined according to the feedback of the user, so as to update the text conversion matching rule set.

Through the embodiment, the text conversion matching rule set can be further updated to be more accurate by openly receiving the error feedback of the user.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the text standardization processing method after the voice recognition in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform the text normalization processing method after speech recognition in any of the above-described method embodiments.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the text normalization processing method comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the steps of the text normalization processing method after voice recognition according to any embodiment of the invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A text standardization processing method after speech recognition comprises the following steps:

inputting the pure language text to be standardized after voice recognition into the badcase module, caching at least one word in the pure language text when the word hits the matching rule in the set, and outputting the word after reverse text marking;

2. The method of claim 1, wherein the converting the text corresponding to the tag 1 comprises: and converting the text numeric characters corresponding to the label 1 into Arabic numeric characters.

3. The method of claim 2, wherein the plain language text comprises at least: pure Chinese text without Arabic numerals, pure English text without Arabic numerals, and Chinese-English mixed text without Arabic numerals.

4. The method of claim 1, wherein after the determining the text normalization result for the plain language text, the method further comprises:

5. The method of claim 1, wherein the inverse text label comprises: brackets [ ].

6. The method of claim 1, wherein the special symbol comprises:

and (4) a symbol.

7. The method of claim 1, wherein the outputting 0/1 a tag sequence comprises: when the probability value output by the model is larger than a preset label threshold value, the character label is 1, otherwise, the label is 0;

said outputting 0/1 a sequence of tags and determining a confidence that a model can convert said first processed text comprises:

determining the probability that each character in the first processed text can be converted;

determining a confidence level of the first processed text based on a mean of the probabilities that the respective characters can be converted.

8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-7.

9. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.