CN115809641A - ASR text error correction method, model, device, electronic equipment and storage medium - Google Patents

ASR text error correction method, model, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115809641A
CN115809641A CN202211524451.6A CN202211524451A CN115809641A CN 115809641 A CN115809641 A CN 115809641A CN 202211524451 A CN202211524451 A CN 202211524451A CN 115809641 A CN115809641 A CN 115809641A
Authority
CN
China
Prior art keywords
text
error correction
unit
pinyin
unit text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211524451.6A
Other languages
Chinese (zh)
Inventor
杨稷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Changan Automobile Co Ltd
Original Assignee
Chongqing Changan Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Changan Automobile Co Ltd filed Critical Chongqing Changan Automobile Co Ltd
Priority to CN202211524451.6A priority Critical patent/CN115809641A/en
Publication of CN115809641A publication Critical patent/CN115809641A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The application provides an ASR text error correction method, which comprises the steps of obtaining a speech recognition text; inputting the voice recognition text into a trained error correction model for error correction prediction to obtain an error correction prediction result of each unit text; acquiring a unit text set of the score Topk as an error correction candidate set of the unit text; obtaining an error correction result of the unit text; and synthesizing the error correction results of all the unit texts to obtain the error correction result of the voice recognition text. According to the method and the device, the error correction prediction result of each unit text is obtained by performing error correction prediction on the voice recognition text, then the error correction is performed on each unit text independently, the error correction results of the unit texts are comprehensively obtained, and the problem that a BERT model is easy to correct excessively is solved. By adopting an ASR (asynchronous serial conversion) error correction method based on a BERT (belief transform) model end to end, the problem that errors of a pipeline architecture can be amplified step by step to influence the final effect of the error correction method is solved.

Description

ASR text error correction method, model, device, electronic equipment and storage medium
Technical Field
The present application relates to the field of text error correction technologies, and in particular, to an ASR text error correction method, model, apparatus, electronic device, and storage medium.
Background
Text error correction is one of the basic tasks of natural language processing, and the main content of the text error correction is detection and correction of spelling errors in the text. The application scenarios of the method are very wide, such as input method error correction, text proofreading, ASR (Automatic Speech Recognition) text error correction, and the like. Moreover, text error correction is generally used as an upstream task, and the performance of the text error correction directly influences the final effect of a downstream task. Common types of text error correction tasks include harmonic words, confusing words, similar word errors, word completion, and the like, and for different application scenarios, these error types do not necessarily exist all together. Text error correction is still a hot spot of research today because of its importance. In the early days, people mainly performed text error correction based on the pipeline system architecture of error detection, candidate recall and candidate ranking, for example, chinese patents CN201510767379.3, CN201610976879.2 and CN201710817047.0 are text error correction methods based on this system architecture. The method is intuitive in realization thought, strong in interpretability and easy in module upgrading and replacement of a modular structure, but errors are amplified step by step due to the pipeline system structure, the final effect of the error correction method is affected, and meanwhile, the longer the serial chain is, the longer the time is consumed. Later, with the rise of deep learning, people began to focus on end-to-end text error correction methods. The error correction method improves the text error correction effect to a certain extent, but the deep learning model needs a large amount of training data to learn the model parameters, the labeled data is time-consuming and labor-consuming, the labor cost is extremely high, and meanwhile, compared with the traditional text error correction method, the error correction method is easy to generate error correction.
With the rapid development of voice products such as intelligent sound boxes and voice-controlled furniture in recent years, the voice recognition technology is also rapidly improved, the recognition effect is better and better, but because the complexity of the use scene of the voice products still easily causes the situation of misrecognition, the error types of the voice products mainly comprise harmonic words and confusing voice words, and therefore the ASR correction is used as a subset of text correction, the solution method is consistent with the text correction, and the faced difficulties are similar. At present, a new speech recognition error correction method is urgently needed to solve the difficult pain points to some extent.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present application provides a solution to the above-mentioned technical problems.
The application provides an ASR text error correction method, which comprises the steps of obtaining a speech recognition text;
inputting the voice recognition text into a trained error correction model for error correction prediction to obtain an error correction prediction result of each unit text;
acquiring a unit text set of the score Topk as an error correction candidate set of the unit text;
if the error correction prediction result of the unit text exists in the error correction candidate set of the unit text, taking the prediction result of the unit text as the error correction result of the unit text; if the error correction prediction result of the unit text does not exist in the candidate error correction set of the unit text, selecting the unit text with the highest score in the candidate error correction set of the unit text as the error correction result of the unit text;
and synthesizing the error correction results of all the unit texts to obtain the error correction result of the voice recognition text.
In an embodiment of the present application, inputting the speech recognition text into a trained error correction model for error correction prediction, and obtaining an error correction prediction result of each unit text, includes: coding a voice recognition text to obtain voice recognition text characteristics, wherein the voice recognition text characteristics comprise Chinese character information and pinyin information; detecting unit text errors, and outputting unit text error probability to guide error correction of the unit text; performing integrated masking on the speech recognition text characteristics and the unit text error probability, and sending the integrated masking to a predictor; and obtaining the error correction prediction result of the unit text according to the integration information.
In an embodiment of the present application, the training of the error correction model includes: obtaining training data, wherein the training data comprises characters and pinyin corresponding to the characters, and the pinyin comprises initials, finals and tones; inputting the training data to a pinyin BERT to train the pinyin BERT; the character BERT and the trained pinyin BERT are combined to train the detector so as to improve the accuracy of the detector for detecting unit text errors.
In an embodiment of the application, inputting the training data to the pinyin BERT to train the pinyin BERT includes: inputting pinyin of a text to predict Chinese characters of the text; inputting the correct pinyin of the text, and comparing the similarity between the correct pinyin of the text and the predicted pinyin of the text.
In an embodiment of the present application, the loss function formula of the pinyin BERT is:
L total =L a +L b +L similarity
wherein, the cross entropy loss of Chinese character predicted by pinyin in sentence a is shown; representing that the sentence b predicts the cross entropy loss of the Chinese character by the pinyin; representing the similarity cross entropy loss of the pinyin representations of the sentence a and the sentence b. The cross entropy is calculated as follows:
Figure BDA0003972510220000031
in the formula: representing the true label and representing the prediction result.
In an embodiment of the present application, the training of the error correction model includes:
L all =αL correct +(1-α)L contrast
in the formula: represents error correction loss, cross entropy loss is used herein; indicating a loss of contrast; and 1 represents the weight of error correction loss, is a hyperparameter with the weight of contrast loss and the value range of (0, 1). The calculation formula is as follows:
Figure BDA0003972510220000032
in the formula: is the number of negative samples compared to the positive samples; representing the prediction probability of a positive sample; a predicted probability of being the first negative sample; margin is the discrimination of the positive sample and the negative sample, the value range is [0,1], and the discrimination of the positive sample and the negative sample is higher when the value is larger.
The present application further provides an ASR text correction model, the model comprising:
the encoder encodes the text to obtain text characteristics; the encoder comprises a Chinese character BERT and a pinyin BERT, the Chinese character BERT and the pinyin BERT respectively encode Chinese character information and pinyin information of a text, and the two parts are added to be used as output of the encoder;
a detector, said detector detecting the location of the error occurrence and directing the predictor to correct the error, the output of said detector being the probability that each word is a wrong word; the detector output vector dimension is 1 XN in ,N in A word number representing an input word;
a mask that integrates the outputs of the encoder and the detector by the following formula:
E soft-masked =(1-p)E encode +p*E mask
wherein: p is the output of the detector, representing the probability that each word is a wrong word; e encode The output of the encoder is a matrix which represents the result of each word after context and pinyin encoding; e mask Is a vector marking a word as a wrong word;
a predictor predicting a correct word corresponding from each word, the predictor using E soft-masked And E encode As inputs, the predictor output dimension is N in ×N Vocab In which N is Vocab Representing the size of the dictionary.
The application also provides an ASR text error correction device, which comprises:
the voice recognition text acquisition module acquires the voice recognition text;
the text error correction module is used for carrying out error correction prediction on the voice recognition text to obtain an error correction prediction result of each unit text; comparing the error correction prediction result of the unit text with the error correction candidate set of the unit text, and if the error correction prediction result of the unit text exists in the error correction candidate set of the unit text, taking the prediction result of the unit text as the error correction result of the unit text; if the error correction prediction result of the unit text does not exist in the candidate error correction set of the unit text, selecting the unit text with the highest score in the candidate error correction set of the unit text as the error correction result of the unit text; and integrating the error correction results of all the unit texts to obtain the error correction result of the voice recognition text.
The present application further provides an electronic device, the electronic device including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to implement the ASR text correction method as in any one of the above.
The present application also provides a computer-readable storage medium, having stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform an ASR text correction method as claimed in any one of the above.
The beneficial effect of this application: according to the method and the device, the error correction prediction result of each unit text is obtained by performing error correction prediction on the voice recognition text, then each unit text is subjected to individual error correction, the error correction results of the unit texts are comprehensively obtained, and the problem that a BERT model is easy to correct excessively is solved. By adopting an ASR (asynchronous serial conversion) error correction method based on a BERT (belief transform) model end to end, the problem that errors of a pipeline architecture can be amplified step by step to influence the final effect of the error correction method is solved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
FIG. 1 is a schematic diagram of an error correction model shown in an exemplary embodiment of the present application;
FIG. 2 is a schematic diagram of an ASR text error correction method shown in an exemplary embodiment of the present application;
FIG. 3 is a schematic diagram of an exemplary embodiment of step S220 in the embodiment of FIG. 2;
FIG. 4 is a diagram illustrating model training in accordance with an exemplary embodiment;
FIG. 5 is a diagram illustrating Pinyin BERT training in accordance with an exemplary embodiment of the present application;
FIG. 6 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.
Detailed Description
Other advantages and effects of the present application will become apparent to those skilled in the art from the disclosure herein, wherein the embodiments of the present application will be described in detail with reference to the accompanying drawings and preferred embodiments. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It should be understood that the preferred embodiments are for purposes of illustration only and are not intended to limit the scope of the present disclosure.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present application, and the drawings only show the components related to the present application and are not drawn according to the number, shape and size of the components in actual implementation, and the type, number and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
In the following description, numerous details are set forth to provide a more thorough explanation of the embodiments of the present application, however, it will be apparent to one skilled in the art that the embodiments of the present application may be practiced without these specific details, and in other embodiments, well-known structures and devices are shown in block diagram form rather than in detail in order to avoid obscuring the embodiments of the present application.
In many chinese-related landing scenarios, text error correction is involved, such as speech or text dialogue with various robots, or scanning relevant PDF or pictures with a mobile phone, or typing with an input method while chatting with a person, and so on, and errors may occur in speech information recognized by ASR, picture information recognized by OCR (optical character recognition), or text actually passed through the input method by a user. These errors affect the readability of the text, are not conducive to human and machine understanding, and if these errors are not processed, they can be propagated to the subsequent links, affecting the effect of the subsequent tasks. Common Chinese error types include a) similar pronunciation, cobra vs cobra; b) The Chinese character is similar, and the sorghum vs jowar is; c) The word order is reversed, and the peace vs is peace; d) Pinyin full pinyin or abbreviation, shenzhen vs Shenzhen; e) Not meeting the context, the wave jury is a juridical vs.
The proportion of different error types appearing in different scenes is different, but whatever error type affects the quality of the text, and the reading comprehension of people or machines is hindered. Most models are obtained through high-quality Chinese data training, and the error types are not found in the training corpus, so that the understanding of the models can be influenced to a certain extent if the errors exist in the text input during reasoning, and the reasoning effect of the models is greatly reduced.
ASR refers to Automatic Speech Recognition technology (Automatic Speech Recognition), which is a technology for converting human Speech into text. Speech recognition is a multidisciplinary intersection field that is tightly connected to many disciplines, such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science, and the like. Due to the diversity and complexity of speech signals, speech recognition systems can only achieve satisfactory performance under certain constraints, or can only be used in certain specific situations. The performance of a speech recognition system depends roughly on several types of factors: identifying a size of the vocabulary and a complexity of the speech; the quality of the speech signal; whether a single speaker is a multiple speaker; hardware. With the rapid development of voice products such as intelligent sound boxes and voice-operated furniture in recent years, the voice recognition technology is also rapidly advanced, the recognition effect is better and better, but the situation of false recognition still easily occurs due to the complexity of the use scene of the voice products.
Referring to FIG. 2, FIG. 2 is a schematic diagram of an ASR text error correction method according to an exemplary embodiment of the present application, which is described in detail below;
and step S210, obtaining a voice recognition text.
The speech recognition text of the user is collected by a speech collection device.
And S220, carrying out error correction prediction on the speech recognition text in the trained error correction model to obtain an error correction prediction result of each unit text.
Before error correction prediction is performed, an error correction model needs to be constructed through preprocessed data. In an exemplary embodiment, the data preprocessing comprises two parts of data annotation and data enhancement. The data marking needs to mark all pinyins corresponding to the training texts, wherein the pinyins comprise initials, finals and tones. The wrong word in the text is marked as 1 and the correct word is marked as 0. In one embodiment, the method of harmonic word and confusion word replacement is used for data enhancement, so that a data set is enlarged, the generalization capability of a model is enhanced, overfitting of the model is avoided, and finally, original data and enhanced data are integrated and randomly segmented into a training set and a verification set.
As shown in fig. 1, fig. 1 is a schematic diagram of an error correction model according to an exemplary embodiment of the present application, and the construction of the model introduces pinyin information to optimize the capability of detecting harmonic words and confusing words. The ASR text correction model is shown in fig. 1 and mainly consists of an encoder, a detector, a masker, and a predictor. The encoder is used for encoding the sentence so as to obtain the sentence characteristics, and the sentence characteristics comprise Chinese characters BERT and pinyin BERT, wherein the two BERTs are used for encoding character information and pinyin information of the sentence respectively, and then the two parts are added to be used as the output of the encoder.
The detector is used for detecting the position of the error, so as to guide the predictor to carry out error correction and further relieve the BERT over-correction problem, which can be solved by the whole systemConnecting neural networks, LSTM, GRU, or other networks, provided that the final output vector dimension is guaranteed to be 1 XN in I.e. N in Indicating the number of words entered. The output of the detector is the probability that each word is a wrong word, the closer to 1, the more likely the word is a wrong word.
The output of the encoder and the detector are then integrated by this module of the mask, and the specific integration method is shown in the following formula:
E soft-masked =(1-p)E encode +p*E mask
in the formula: p is the output of the detector and is a vector representing the probability that each word is a wrong word; e encode The output of the encoder is a matrix which represents the result of each word after context and pinyin encoding; e mask Is a vector that marks a word as a wrong word.
The predictor is used for predicting the correct word corresponding to each word, and the module uses E soft-masked And E encode Is used as an input of a predictor, and simultaneously outputs a dimension N in ×N Vocab Matrix of N Vocab Representing the size of the dictionary. The predictor, like the detector, can be composed of different network architectures, as long as the input-output design is satisfied.
Referring to fig. 3, fig. 3 is a schematic diagram of an exemplary embodiment of step S220 in the embodiment shown in fig. 2, and shows an exemplary error correction prediction result flow, which is described in detail as follows.
And 310, coding the voice recognition text to obtain voice recognition text characteristics, wherein the voice recognition text characteristics comprise Chinese character information and pinyin information.
The encoder encodes the speech recognition text to obtain sentence characteristics, including encoding word information and pinyin information of the sentence, and adding the two parts to serve as the output of the encoder.
And step S320, detecting unit text errors, and outputting unit text error probability to guide error correction of the unit text.
The detector detects the position of the error, so that the predictor is known to correct the error, and the BRET over-correction problem is relieved. The closer the detector outputs the probability that each unit text, i.e. each word, is a wrong word, the closer to 1, the more likely the word is a wrong word.
And step S330, integrating and masking the speech recognition text characteristics and the unit text error probability, and sending the integrated and masked speech recognition text characteristics and unit text error probability to a predictor.
The output of the encoder and the detector is integrated by the mask and then output.
And step S340, obtaining the error correction prediction result of the unit text according to the integration information.
Predictor usage E soft-masked And E encode Is used as an input of a predictor, and simultaneously outputs a dimension N in ×N Vocab Matrix of N Vocab Representing the size of the dictionary.
After the model is built, the model needs to be trained, and the model comprises a pinyin BERT for coding the pinyin of a sentence to help the model to capture the characteristics of pronunciation. Firstly, training the Pinyin BERT independently by using training data, and training the Pinyin BERT by using a multi-task method, wherein the training comprises two training tasks, one is inputting Pinyin of sentences to predict Chinese characters of the sentences, and the other task is inputting Pinyin of sentence pairs to calculate the similarity of the Chinese characters and the Chinese characters. Note that the second task uses the BERT-based output. The loss function of the pinyin BERT is thus shown by the following formula:
L total =L a +L b +L similarity
in the formula: l is a Representing the sentence a to predict the cross entropy loss of the Chinese character by pinyin; l is b Representing that the sentence b predicts the cross entropy loss of the Chinese character by the pinyin; l is similarity Representing the similarity cross entropy loss of the pinyin representations of the sentence a and the sentence b. The cross entropy is calculated as follows:
Figure BDA0003972510220000092
in the formula: y tableThe actual label is shown as such,
Figure BDA0003972510220000091
indicating the prediction result.
And then the pre-trained Chinese character BERT and the trained pinyin BERT are combined to be used as an encoder training detector. When the detector is trained, the bottom layer parameters of the Pinyin BERT are frozen, and learning updating is not carried out. The loss function used for the detector training is also cross entropy.
And finally, training an error correction model, and introducing contrast learning to alleviate the problem in view of the fact that the BERT-based error correction model tends to focus attention on high-frequency words, and meanwhile, the model fitting speed can be accelerated by the contrast learning. The training method adds contrast loss in addition to the traditional one-to-one error correction loss, and the calculation formula is as follows:
L all =αL correct +(1-α)L contrast in the formula: l is correct Represents error correction loss, cross entropy loss is used herein; l is a radical of an alcohol contrast Represents the loss of contrast; alpha represents the weight of error correction loss, 1-alpha is the weight of contrast loss, is a hyper-parameter, and has the value range of (0, 1). Wherein L is contrast The calculation formula of (a) is as follows:
Figure BDA0003972510220000101
in the formula: k is the number of negative samples compared to positive samples; p is a radical of + Representing the prediction probability of a positive sample;
Figure BDA0003972510220000102
the predicted probability of the jth negative sample; margin is the discrimination of the positive sample and the negative sample, and the value range is [0,1]]The larger the value is, the higher the discrimination of positive and negative samples is. The sampling rule of the negative examples is: in addition to the positive samples, other samples of the probability score TopK.
In the process of obtaining the optimal model parameters, the model is trained by using training data, the model is tested by using a verification set, and the model which best represents on the verification set is stored.
Step S230, obtaining a unit text set of the score Topk as an error correction candidate set of the unit text.
And step S240, judging whether the unit text error correction prediction result is in a unit text error correction candidate set.
By obtaining the error correction prediction result for each word, a word-by-word error correction is subsequently performed.
Step S250, if the error correction prediction result of the unit text exists in the error correction candidate set of the unit text, the prediction result of the unit text is used as the error correction result of the unit text; and if the error correction prediction result of the unit text does not exist in the candidate error correction set of the unit text, selecting the unit text with the highest score in the candidate error correction set of the unit text as the error correction result of the unit text.
And step S260, integrating the error correction results of all the words to obtain the error correction result of the voice recognition text.
The ASR correction method adopted by the application avoids the defect of a pipeline architecture and solves the problem that a BERT model is easy to correct excessively. Meanwhile, the method for optimizing BERT error correction by using the training method of contrast learning tends to high-frequency words and needs a large amount of training data to avoid the problem of overfitting. The method and the device introduce pinyin information to carry out targeted optimization on the scene of error correction of the voice recognition text, so that the attention is focused on harmonic words and confusing a candidate set of the voice words during error correction, and the error correction accuracy is improved.
The application also provides an ASR text error correction device, which comprises: the voice recognition text acquisition module acquires the voice recognition text;
the text error correction module is used for carrying out error correction prediction on the voice recognition text to obtain an error correction prediction result of each unit text; comparing the error correction prediction result of the unit text with the error correction candidate set of the unit text, and if the error correction prediction result of the unit text exists in the error correction candidate set of the unit text, taking the prediction result of the unit text as the error correction result of the unit text; if the error correction prediction result of the unit text does not exist in the candidate error correction set of the unit text, selecting the unit text with the highest score in the candidate error correction set of the unit text as the error correction result of the unit text; and synthesizing the error correction results of all the unit texts to obtain the error correction result of the voice recognition text.
It should be noted that the ASR text error correction apparatus provided in the foregoing embodiment and the ASR text error correction method provided in the foregoing embodiment belong to the same concept, and specific ways of performing operations by the respective modules and units have been described in detail in the method embodiment, and are not described herein again. In practical applications, the road condition refreshing apparatus provided in the above embodiment may distribute the above functions through different functional modules according to needs, that is, divide the internal structure of the apparatus into different functional modules to complete all or part of the above described functions, which is not limited herein.
An embodiment of the present application further provides an electronic device, including: one or more processors; a storage device, configured to store one or more programs, which when executed by the one or more processors, cause the electronic device to implement the ASR text correction method provided in the various embodiments described above.
FIG. 6 illustrates a schematic structural diagram of a computer system suitable for use to implement the electronic device of the embodiments of the present application. It should be noted that the computer system 600 of the electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601, which can perform various suitable actions and processes, such as executing the method described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 602 or a program loaded from a storage portion 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for system operation are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An Input/Output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted into the storage section 608 as necessary.
In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. When the computer program is executed by a Central Processing Unit (CPU) 601, various functions defined in the system of the present application are executed.
It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer-readable signal medium may comprise a propagated data signal with a computer-readable computer program embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
Another aspect of the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor of a computer, causes the computer to perform the ASR text error correction method as described above. The computer-readable storage medium may be included in the electronic device described in the above embodiment, or may exist separately without being incorporated in the electronic device.
Another aspect of the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the ASR text error correction method provided in the various embodiments described above.
The above-described embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the application. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical concepts disclosed in the present application shall be covered by the claims of the present application.

Claims (10)

1. An ASR text correction method, the method comprising:
obtaining a voice recognition text;
inputting the voice recognition text into a trained error correction model for error correction prediction to obtain an error correction prediction result of each unit text;
acquiring a unit text set of the score Topk as an error correction candidate set of the unit text;
if the error correction prediction result of the unit text exists in the error correction candidate set of the unit text, taking the prediction result of the unit text as the error correction result of the unit text; if the error correction prediction result of the unit text does not exist in the candidate error correction set of the unit text, selecting the unit text with the highest score in the candidate error correction set of the unit text as the error correction result of the unit text;
and synthesizing the error correction results of all the unit texts to obtain the error correction result of the voice recognition text.
2. The ASR text correction method according to claim 1, wherein inputting the speech recognition text into a trained correction model for correction prediction to obtain a correction prediction result for each unit text, comprises:
coding a voice recognition text to obtain voice recognition text characteristics, wherein the voice recognition text characteristics comprise Chinese character information and pinyin information;
detecting unit text errors, and outputting unit text error probability to guide error correction of the unit texts;
performing integrated masking on the speech recognition text features and the unit text error probability, and sending the integrated masking to a predictor;
and obtaining the error correction prediction result of the unit text according to the integration information.
3. The ASR text correction method of claim 2, wherein the training of the correction model comprises:
obtaining training data, wherein the training data comprises characters and pinyin corresponding to the characters, and the pinyin comprises initials, finals and tones;
inputting the training data to a pinyin BERT to train the pinyin BERT;
the character BERT and the trained pinyin BERT are combined to train the detector so as to improve the accuracy of the detector for detecting unit text errors.
4. The ASR text correction method of claim 3 wherein inputting the training data to a pinyin BERT to train the pinyin BERT comprises:
inputting pinyin of a text to predict Chinese characters of the text;
inputting the correct pinyin of the text, and comparing the similarity between the correct pinyin of the text and the predicted pinyin of the text.
5. The ASR text correction method of claim 4, wherein the loss function formula of Pinyin BERT is: l is a radical of an alcohol total =L a +L b +L similarity
Wherein L is a Representing the sentence a to predict the cross entropy loss of the Chinese character by pinyin; l is b Representing that the sentence b predicts the cross entropy loss of the Chinese character by the pinyin; l is similarity Representing the similarity cross entropy loss of the pinyin representations of the sentence a and the sentence b. The cross entropy is calculated as follows:
Figure FDA0003972510210000021
in the formula: y represents the true label and the true label,
Figure FDA0003972510210000022
indicating the prediction result.
6. The ASR text correction method of claim 3, wherein the training of the correction model comprises:
L all =αL correct +(1-α)L contrast
in the formula: l is correct Represents error correction loss, cross entropy loss is used herein; l is contrast Indicating a loss of contrast; alpha represents the weight of error correction loss, 1-alpha is the weight of contrast loss, is a hyper-parameter, and has the value range of (0, 1). Wherein L is contrast The calculation formula of (c) is as follows:
Figure FDA0003972510210000031
in the formula: k is the number of negative samples compared to positive samples; p is a radical of formula + Representing the prediction probability of a positive sample;
Figure FDA0003972510210000032
the predicted probability of the jth negative sample; margin is the discrimination of the positive sample and the negative sample, and the value range is [0,1]]The larger the value is, the higher the discrimination of positive and negative samples is.
7. An ASR text error correction model, the model comprising:
the encoder encodes the text to obtain text characteristics; the encoder comprises a Chinese character BERT and a pinyin BERT, the Chinese character BERT and the pinyin BERT respectively encode Chinese character information and pinyin information of a text, and the two parts are added to be used as output of the encoder;
the detector detects the position where the error occurs and guides the predictor to correct the error, and the output of the detector is the probability that each word is a wrong word; the detector output vector dimension is 1 XN in ,N in A word number representing an input word;
a mask that integrates the outputs of the encoder and the detector by the following formula:
E soft-masked =(1-p)E encode +p*E mask
wherein: p is the output of the detector, representing the probability that each word is a wrong word; e encode The output of the encoder is a matrix which represents the result of each word after context and pinyin encoding; e mask Is a vector marking a word as a wrong word;
a predictor predicting a correct word corresponding from each word, the predictor using E soft-masked And E encode Is taken as an input, the predictor output dimension is N in ×NV ocab In which N is Vocab Representing the size of the dictionary.
8. An ASR text correction apparatus, comprising:
the voice recognition text acquisition module acquires the voice recognition text;
the text error correction module is used for carrying out error correction prediction on the voice recognition text to obtain an error correction prediction result of each unit text; comparing the error correction prediction result of the unit text with the error correction candidate set of the unit text, and if the error correction prediction result of the unit text exists in the error correction candidate set of the unit text, taking the prediction result of the unit text as the error correction result of the unit text; if the error correction prediction result of the unit text does not exist in the candidate error correction set of the unit text, selecting the unit text with the highest score in the candidate error correction set of the unit text as the error correction result of the unit text; and synthesizing the error correction results of all the unit texts to obtain the error correction result of the voice recognition text.
9. An electronic device, characterized in that the electronic device comprises:
one or more processors;
storage means for storing one or more programs that, when executed by the one or more processors, cause the electronic device to implement the ASR text correction method of any of claims 1-7.
10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform the ASR text correction method of any of claims 1-7.
CN202211524451.6A 2022-11-30 2022-11-30 ASR text error correction method, model, device, electronic equipment and storage medium Pending CN115809641A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211524451.6A CN115809641A (en) 2022-11-30 2022-11-30 ASR text error correction method, model, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211524451.6A CN115809641A (en) 2022-11-30 2022-11-30 ASR text error correction method, model, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115809641A true CN115809641A (en) 2023-03-17

Family

ID=85484596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211524451.6A Pending CN115809641A (en) 2022-11-30 2022-11-30 ASR text error correction method, model, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115809641A (en)

Similar Documents

Publication Publication Date Title
Seki et al. An end-to-end language-tracking speech recognizer for mixed-language speech
CN111739508B (en) End-to-end speech synthesis method and system based on DNN-HMM bimodal alignment network
CN112712804A (en) Speech recognition method, system, medium, computer device, terminal and application
CN112420024B (en) Full-end-to-end Chinese and English mixed empty pipe voice recognition method and device
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN111414745A (en) Text punctuation determination method and device, storage medium and electronic equipment
CN110852040A (en) Punctuation prediction model training method and text punctuation determination method
CN112216267A (en) Rhythm prediction method, device, equipment and storage medium
CN115101042A (en) Text processing method, device and equipment
CN113823259B (en) Method and device for converting text data into phoneme sequence
Zhao et al. Tibetan Multi-Dialect Speech and Dialect Identity Recognition.
Zhao et al. Tibetan multi-dialect speech recognition using latent regression Bayesian network and end-to-end mode
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
Yang et al. Non-native acoustic modeling for mispronunciation verification based on language adversarial representation learning
CN116597809A (en) Multi-tone word disambiguation method, device, electronic equipment and readable storage medium
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
Lu et al. Implementation of embedded unspecific continuous English speech recognition based on HMM
CN115809641A (en) ASR text error correction method, model, device, electronic equipment and storage medium
CN115099222A (en) Punctuation mark misuse detection and correction method, device, equipment and storage medium
CN114372467A (en) Named entity extraction method and device, electronic equipment and storage medium
CN114333790A (en) Data processing method, device, equipment, storage medium and program product
Youa et al. Research on dialect speech recognition based on DenseNet-CTC
Ghorpade et al. ITTS model: speech generation for image captioning using feature extraction for end-to-end synthesis
CN113555006B (en) Voice information identification method and device, electronic equipment and storage medium
Dandge et al. Multilingual Global Translation using Machine Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination