CN115101072A

CN115101072A - Voice recognition processing method and device

Info

Publication number: CN115101072A
Application number: CN202210633839.3A
Authority: CN
Inventors: 李清涛
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-09-23

Abstract

The invention discloses a voice recognition processing method and device, and relates to the technical field of computers. One embodiment of the method comprises: in response to the monitoring of the input voice information, performing voice recognition transcription on the voice information to obtain a transcription result, and calling an error detector to detect error characters in the transcription result; calling a pre-training language model, carrying out mask processing on the positions of the error characters in the transcription result, predicting candidate characters at the mask positions, and calculating the probability of each candidate character appearing at the mask position; calculating the pronunciation characteristic distance of each candidate character and each error character, and obtaining the selection probability value of each candidate character by combining the probability of each candidate character appearing at the mask position; and screening and selecting the target candidate character with the maximum probability value, replacing the error character with the target candidate character, obtaining the error-corrected transcription text, and returning to display. The embodiment considers the context semantics and the pronunciation characteristics, and realizes effective error correction of the voice recognition transcription result.

Description

Voice recognition processing method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for speech recognition processing.

Background

The quality of speech recognition is often based on a recognition error rate (word error rate, WER). Although the technology of Automatic Speech Recognition (ASR) is continuously developed and the amount of available data is continuously increased, the transcription result of the ASR system still has a lot of errors with similar speech semantics, so that the post-processing of the speech recognition system is very important, and the problem of the recognition errors of the speech semantics can be well solved by error correction.

The current error correction method mainly comprises two types: 1) the error correction method based on the grammar rules mainly adopts the handwriting grammar rules and utilizes a rule matching method to correct the error of the transcription result of the ASR. 2) Semantic error correction is performed by using a large-scale Pre-training model in natural Language processing, for example, by using BERT (Pre-training of Deep Bidirectional transforms for large Language Understanding), which is an open-source Pre-training Language model based on large-scale corpus, to correct the transcription result of ASR.

In the process of implementing the invention, the inventor finds that the prior art has at least the following problems: the mode 1 can only correct errors meeting grammar rules, information of voice semantics is not considered, the mode 2 ignores voice information, and effective error correction cannot be performed on words with the same or similar pronunciations.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for speech recognition processing, which can at least solve the problem of low effective error correction efficiency caused by no consideration of speech semantic features in the prior art.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a speech recognition processing method including:

in response to the fact that input voice information is monitored, voice recognition transcription is carried out on the voice information to obtain a transcription result, and an error detector is called to detect error characters in the transcription result;

calling a pre-training language model, performing mask processing on the positions of the error characters in the transcription result, predicting candidate characters at mask positions, and calculating the probability of each candidate character appearing at the mask positions;

calculating the pronunciation characteristic distance of each candidate character and the error character, and obtaining the selection probability value of each candidate character by combining the probability of each candidate character appearing at the mask position;

and screening out the target candidate character with the maximum selection probability value, replacing the error character with the target candidate character, obtaining the corrected transcription text and returning to display.

Optionally, before the invoking error detector detects the error character in the transcription result, the method further includes:

acquiring a sample voice information set, calling a voice recognition program to perform voice recognition transcription on each sample voice information to obtain a plurality of sample transcription results;

receiving the labeling operation of the error characters in the transcription result of each sample, and carrying out fine tuning training on the pre-training language model by taking the sample voice information and the correspondingly labeled error characters as input to obtain an error detector;

wherein, carrying out fine tuning training on the pre-training language model comprises: on the basis of a trained pre-training language model, a linear layer neural network is added, and only parameter adjustment is carried out on the linear layer neural network during training.

Optionally, the calculating a distance between each candidate character and the pronunciation feature of the wrong character includes:

acquiring a first initial consonant, a first final sound and a first tone of the wrong character in the Chinese pinyin, and a second initial consonant, a second final sound and a second tone of the candidate character in the Chinese pinyin;

and calculating the Euler distances of the initials of the first initial consonant and the second initial consonant, the Euler distances of the finals of the first final consonant and the second final vowel, and the pronunciation tone distances of the first tone and the second tone so as to accumulate to obtain the pronunciation characteristic distances of the candidate characters and the error characters.

Optionally, the obtaining the corrected transcription text and returning to display further includes:

calling a rule transcription program, and searching characters in the transcribed text in a rule transcription dictionary; the rule transcription dictionary comprises characters to be transcribed by self-defining grammar rules;

and in response to the search result being present, determining characters to be replaced in the regular transcription dictionary corresponding to the characters in the transcribed text, so as to replace the characters in the transcribed text with the characters to be replaced.

To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided a speech recognition processing apparatus including:

the detection module is used for responding to the monitoring of input voice information, performing voice recognition transcription on the voice information to obtain a transcription result, and calling an error detector to detect error characters in the transcription result;

the prediction module is used for calling a pre-training language model, performing mask processing on the positions of the error characters in the transcription result, predicting candidate characters at the mask positions and calculating the probability of each candidate character appearing at the mask positions;

the calculation module is used for calculating the pronunciation characteristic distance of each candidate character and the error character, and obtaining the selection probability value of each candidate character by combining the probability of each candidate character appearing at the mask position;

and the replacing module is used for screening out the target candidate character with the maximum selection probability value, replacing the error character with the target candidate character, obtaining the corrected transcription text and returning to display.

Optionally, the system further comprises a training module, configured to:

wherein, carrying out fine tuning training on the pre-training language model comprises the following steps: on the basis of a trained pre-training language model, a linear layer neural network is added, and only parameter adjustment is carried out on the linear layer neural network during training.

Optionally, the calculating module is configured to:

and calculating the Euler distance of the initials of the first initial consonant and the second initial consonant, the Euler distance of the finals of the first final vowel and the second final vowel, and the pronunciation tone distance of the first tone and the second tone so as to accumulate to obtain the pronunciation characteristic distance of the candidate character and the error character.

Optionally, the replacing module is further configured to:

To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided an electronic device for speech recognition processing.

The electronic device of the embodiment of the invention comprises: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement any of the speech recognition processing methods described above.

To achieve the above object, according to a further aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing any of the speech recognition processing methods described above.

According to the scheme provided by the invention, one embodiment of the invention has the following advantages or beneficial effects: adding a layer of neural network on the basis of the pre-training language model for training to obtain an error detector so as to carry out error detection on the ASR transcription result to obtain error characters; calling a pre-training language model to consider the context semantic features so as to predict candidate characters at the positions of the wrong characters; the pronunciation characteristics of each character are considered, and the pronunciation characteristics and the context semantics are effectively fused through a balance function so as to reorder the candidate characters; and then, considering the application scene to carry out grammar rule matching to obtain a final voice recognition transcription result.

Further effects of the above-described non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic flow chart of a speech recognition processing method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a method, in particular a speech recognition processing method, according to an embodiment of the invention;

FIG. 3 is a schematic diagram of the main blocks of a speech recognition processing apparatus according to an embodiment of the present invention;

FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

FIG. 5 is a schematic block diagram of a computer system suitable for use with a mobile device or server implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. According to the technical scheme, the data acquisition, storage, use, processing and the like meet the relevant regulations of national laws and regulations.

The transcription result of the existing ASR system has the recognition error condition with similar speech semantics, and for semantic similar errors, such as the use of words such as ' the ', ' the ' and ' the ' ground ', some sentences originally contain a word of ' noon ', but the result of ' noon ' is obtained through context and speech recognition, and the ' noon ' needs to be corrected to be ' noon ' through post-processing; for speech recognition errors, such as "come again" and "come from", "travel" and "go back", these words are prone to recognition errors and require post-processing for error correction.

Referring to fig. 1, a main flowchart of a speech recognition processing method according to an embodiment of the present invention is shown, including the following steps:

s101: in response to the fact that input voice information is monitored, voice recognition transcription is carried out on the voice information to obtain a transcription result, and an error detector is called to detect error characters in the transcription result;

s102: calling a pre-training language model, performing mask processing on the positions of the error characters in the transcription result, predicting candidate characters at mask positions, and calculating the probability of each candidate character appearing at the mask positions;

s103: calculating the pronunciation characteristic distance of each candidate character and the error character, and obtaining the selection probability value of each candidate character by combining the probability of each candidate character appearing at the mask position;

s104: and screening out the target candidate character with the maximum selection probability value, replacing the error character with the target candidate character, obtaining the corrected transcription text and returning to display.

In the above embodiment, for step S101, the application range of the scheme is relatively wide, for example, the terminal such as an intelligent customer service, an intelligent sound box, an intelligent vehicle, an intelligent dialogue robot, etc., and the voice in the terminal is converted into a text, that is, a voice recognition technology is available. In addition, a more key technology of the scheme is to obtain the pronunciation characteristics of the pinyin, so that the scheme is preferably applied to a Chinese speech recognition scene.

The method comprises the steps that a user inputs information at a terminal, the type of the information can be characters, pictures, videos or voices, and when the terminal monitors that the type of the input information is voices, the terminal calls an automatic voice recognition ASR technology to conduct voice recognition on the voice information to obtain a transcription result. An error detector is used to detect erroneous characters in the transcription result.

Before this, a two-class error detector needs to be constructed, for example, 1 indicates an error character and 0 indicates a correct character. The characteristics of the common training model are as follows: 1) training from scratch, requiring large amounts of data, computation time, and computation resources; 2) the method has the risks of non-convergence of the model, insufficient optimization of parameters, low accuracy, low generalization capability of the model, easiness in overfitting and the like. The scheme adopts a pre-training model (a model which is trained by using a data set) fine-tuning mode, a linear layer neural network is added on the basis of the existing trained BERT model according to two classification tasks, the parameters of the newly added layer neural network are only updated during training, the original parameters of the BERT model cannot be updated, the network structure does not need to be modified, and the possible problems are effectively avoided.

Specifically, a sample speech information set is obtained, an automatic speech recognition ASR technology is called to perform speech recognition on each sample speech information set, and a plurality of sample transcription results are obtained. And (3) marking the error Chinese characters in the sample transcription result based on manual work, and performing fine tuning training on the BERT model by taking the sample voice information set and the error Chinese characters in the transcription result as a positive example to obtain an error detector. The BERT model is a pre-training language model, is suitable for various languages, and is adopted in consideration of the scheme mainly relating to Chinese speech recognition.

For step S102, after the error detector is called to detect the error character in the current transcription result, the position of the error character in the transcription result is determined, and the input of the position is expressed by MASK; wherein, MASK: mask, a predictive method used in BERT, refers to a string of binary digits in computer science and digital logic, and implements the requirements by masking specified locations by bitwise operations. And (4) re-predicting the characters at the mask position through a BERT model to obtain a series of candidate character results as a candidate character set at the position.

For example, { 01000010001 }, for the first wrong character, candidate character set 1{ candidate character 1, candidate character 2, candidate character 3 … … candidate character 10}, for the second wrong character, candidate character set 2{ candidate character 11, candidate character 12, candidate character 13}, and for the third wrong character, candidate character set 3{ candidate character 14, candidate character 15, candidate character 16, candidate character 17 }.

It should be noted that the input of the error detector is the transcription result of the speech recognition, and the output is a classification result, i.e. whether the classification result is an error character, the input of the BERT model is a sentence which is subjected to mask processing on the classification result, and although both are used in the BERT model, two BERT models with completely different purposes are used. And here, the mask position is predicted by using the context modeling capability of the BERT model, the predicted positions are probabilities which respectively correspond to vectors formed by the probabilities of all words in the dictionary during training, for example, 10 words in the dictionary, then the model outputs 10 numbers, each number represents a probability which represents how large the probability that the position is the character is, usually a value from 0 to 1, for example, 0.5 represents a, 0.2 represents B, and 0.3 represents C, so that a character candidate 1-probability 1, a character candidate 2-probability 2, … …, and a character candidate 10-probability 10 are obtained here.

For step S103, an open source method dimim (an open source learning-based high-dimensional coding-based accurate chinese speech similarity algorithm) is used to calculate the pronunciation feature distance between each candidate character in the candidate character set and the transcription error character, and the formula is expressed as:

wherein, the first and the second end of the pipe are connected with each other,

respectively representing the initial consonant, the final and the tone part of the character in the Chinese pinyin. For S _{Initial consonant} 、S _Vowels The scheme preferably selects the Euler distance, and S _{Tone of sound} Then is the pronunciation tone distance; among them, the euclidean distance is the most easily intuitively understood distance measurement method, and the distances of two points in space touched in the calculation are generally referred to as the euclidean distances.

The distance of tone of homophones is 0, and the rest is S _{Initial consonant} 、S _Vowels The sum of the three parts is greater than or equal to 0, so that S (character 1, character 2) obtained by adding the three parts is greater than or equal to 0, and the distance between the pronunciation characteristics of the two characters is greater when the difference of the voices is more obvious.

For example, the input speech message is "happy to serve you," the ASR system recognizes "happy" as "line," and then invokes the error detector to detect the erroneous character "line. Assuming that the candidate characters include "xing", the pronunciation feature distance of the two characters, namely "xing" and "line", is calculated as s (xing2, xing4), wherein the line (xing2) is a wrong character, and "xing (xing4) is a candidate character, so that the euler distance of the initial consonant and the euler distance of the final vowel are both 0, the pronunciation tone distance of the tone part is 2, and the pronunciation feature distance of the two characters is finally calculated to be 2.

The BERT model in the prior art is used for representing characteristics of semantic information, and the DIMSIM is used for representing Chinese pinyin characteristics, so that effective characteristic supplement can be provided for acoustic tasks such as voice recognition. And the existing error correction method basically only considers the characteristics of the semantic part, so that the pronunciation characteristics are fused into the semantic error correction, and the speech recognition error correction can be obviously improved.

Calculating a selection probability value of each candidate character through a balance function to balance semantic features and pronunciation features of the candidate characters and error characters, wherein the balance function is defined as follows in the selection of the final candidate character:

Φ(P _{candidate character} ,S(c _{Error character} ,c _{Candidate character} ))＝P _{Candidate character} ×exp(-α×S(c _{Error character} ,c _{Candidate character} ))

Wherein, P _{Candidate character} Representing the probability of character candidates predicted by the BERT model through context appearing at the mask positions, S (c) _{Error character} ,c _{Candidate character} ) The distance of pronunciation characteristics between each candidate character and the error character is represented, and alpha is a hyper-parameter, which is specified by an empirical value and represents a balance parameter.

After the selection probability value of each candidate character is obtained, the candidate character sets may be reordered to determine a target candidate character with the maximum selection probability value in each candidate character set, and then the wrong character is replaced with the target candidate character to obtain a pronunciation-corrected transcribed text (transcription text). It should be noted that the transcribed text is the final result of speech recognition, and no matter there are several mask positions, a sentence of speech will only have one transcribed text finally, i.e. the final recognition result of the speech.

Also taking the above-mentioned sentence of "serve you with great joy" as an example, the ASR recognition result is "pay you with great behavior", the error detector recognizes the "line" and "pay" recognition error, and taking "line" as an example, the candidate character set where "line" is located by predicting BERT through a mask is four words of { line, sex, star, and joy }, the corresponding probability of occurrence is {0.4, 0.25, 0.1, 0.25}, respectively, where 0.4 is the maximum probability. After feature calculation is continuously carried out on each candidate character in the candidate character set, a new selection probability {0.2, 0.1, 0.15, 0.55} is possibly obtained, the aim of reordering the candidate characters is achieved, the candidate character with the maximum selection probability value is screened out, namely 'xing', and 'xing' is replaced by 'xing'.

Through the steps, the transcribed text with obviously reduced error rate can be obtained, so that the voice recognition result viewed by the user is more correct. In addition, considering practical application scenarios, the speech recognition result is a normal character, and further conversion is needed in some scenarios. Therefore, the words to be transcribed are customized according to different scenes, for example, words to be transcribed are customized through grammar rules, and a rule transcription dictionary is generated, such as cups- > tragedy, mushroom- > cry, girl paper- > girl, porridge- > like, and even- > me.

After the transcribed text is obtained through the steps, the transcribed text is used as input and is input into a rule transcription program, characters to be replaced corresponding to each character in the transcribed text are determined through a rule transcription dictionary, and character replacement is carried out, so that a final output result is obtained. For example: even porridge you- > I like you; cup o- > trageda o; a girl shiitake mushroom- > is a girl crying.

The above is the case where matching exists, such as cup, mushroom, girl paper, but for the case where there is no match, such as "i want $ travel $", the absence of characters in the regular transcription dictionary that match indicates that no processing is required for the transcribed text.

In the method provided by the embodiment, the error detector is trained to perform error detection on the ASR transcription result, the candidate characters are predicted according to the position of the error character by considering the context semantic features based on the BERT model, and the candidate characters are reordered by combining the pronunciation features of each character, so that the transcription result is effectively corrected by a scientific method, and the word error rate of the transcribed text is reduced.

Referring to fig. 2, a flow chart of a specific speech recognition processing method according to an embodiment of the present invention is shown, including the following steps:

s201: acquiring a sample voice information set, calling a voice recognition program to perform voice recognition transcription on each sample voice information to obtain a plurality of sample transcription results;

s202: receiving the labeling operation of the error characters in the transcription result of each sample, and carrying out fine tuning training on the pre-training language model by taking the sample voice information and the correspondingly labeled error characters as input to obtain an error detector; wherein, carrying out fine tuning training on the pre-training language model comprises: adding a linear layer neural network on the basis of a trained pre-training language model, and only performing parameter adjustment on the linear layer neural network during training;

s203: in response to the fact that input voice information is monitored, voice recognition transcription is carried out on the voice information to obtain a transcription result, and an error detector is called to detect error characters in the transcription result;

s204: calling a pre-training language model, performing mask processing on the positions of the error characters in the transcription result, predicting candidate characters at mask positions, and calculating the probability of each candidate character appearing at the mask positions;

s205: acquiring a first initial consonant, a first final sound and a first tone of the wrong character in the Chinese pinyin, and a second initial consonant, a second final sound and a second tone of the candidate character in the Chinese pinyin;

s206: calculating the Euler distances of the initials of the first initial consonant and the second initial consonant, the Euler distances of the finals of the first final consonant and the second final consonant, and the pronunciation tone distances of the first tone and the second tone so as to accumulate to obtain the pronunciation characteristic distances of the candidate characters and the error characters;

s207: obtaining a selection probability value of each candidate character based on the pronunciation feature distance and the probability of each candidate character appearing at the mask position;

s208: and screening out the target candidate character with the maximum selection probability value, replacing the error character with the target candidate character, obtaining the corrected transcription text and returning to display.

Referring to fig. 3, a schematic diagram of main modules of a speech recognition processing apparatus 900 according to an embodiment of the present invention is shown, including:

the detection module 301 is configured to perform voice recognition transcription on voice information in response to monitoring that the voice information is input, obtain a transcription result, and invoke an error detector to detect an error character in the transcription result;

a prediction module 302, configured to invoke a pre-training language model, perform mask processing on the position of the error character in the transcription result, predict candidate characters at a mask position, and calculate a probability that each candidate character appears at the mask position;

the calculating module 303 is configured to calculate a pronunciation feature distance between each candidate character and the error character, and obtain a selection probability value of each candidate character by combining a probability that each candidate character appears at the mask position;

and the replacing module 304 is used for screening out the target candidate character with the maximum selection probability value, replacing the error character with the target candidate character, obtaining the error-corrected transcription text and returning to display.

The device for implementing the invention also comprises a training module used for:

wherein, carrying out fine tuning training on the pre-training language model comprises the following steps: on the basis of a trained pre-training language model, a linear layer neural network is added, and only the linear layer neural network is subjected to parameter adjustment during training.

In the implementation apparatus of the present invention, the calculating module 303 is configured to:

In the device for implementing the present invention, the replacing module 304 is further configured to:

In addition, the specific implementation of the apparatus in the embodiment of the present invention has been described in detail in the above method, and therefore, the repeated description is not repeated here.

Fig. 4 shows an exemplary system architecture 400 to which embodiments of the invention may be applied, including

terminal devices

401, 402, 403, a network 404 and a server 405 (by way of example only).

The

terminal devices

401, 402, 403 may be various electronic devices having display screens and supporting web browsing, and are installed with various communication client applications, and users may interact with the server 405 through the network 404 using the

terminal devices

401, 402, 403 to receive or transmit messages, and the like.

The network 404 serves as a medium for providing communication links between the

terminal devices

401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

The server 405 may be a server providing various services, and it should be noted that the method provided in the embodiment of the present invention is generally executed by the server 405, and accordingly, the apparatus is generally disposed in the server 405, and specifically executes: in response to the fact that input voice information is monitored, voice recognition transcription is carried out on the voice information to obtain a transcription result, and an error detector is called to detect error characters in the transcription result; calling a pre-training language model, performing mask processing on the positions of the error characters in the transcription result, predicting candidate characters at mask positions, and calculating the probability of each candidate character appearing at the mask positions; calculating the pronunciation characteristic distance of each candidate character and the error character, and obtaining the selection probability value of each candidate character by combining the probability of each candidate character appearing at the mask position; and screening out the target candidate character with the maximum selection probability value, replacing the error character with the target candidate character, obtaining the error-corrected transcription text, and returning to display.

It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use in implementing a terminal device of an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. A drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a detection module, a prediction module, a calculation module, and a replacement module. Where the names of these modules do not in some cases constitute a limitation on the modules themselves, for example, a replacement module may also be described as a "character replacement module".

As another aspect, the present invention also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not assembled into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to perform any of the speech recognition processing methods described above.

According to the technical scheme of the embodiment of the invention, a layer of neural network is added on the basis of a pre-training language model for training to obtain an error detector, so that error detection is carried out on an ASR transcription result to obtain an error character; calling a pre-training language model to consider the context semantic features so as to predict candidate characters at the positions of the wrong characters; the pronunciation characteristics of each character are considered, and the pronunciation characteristics and the context semantics are effectively fused through a balance function so as to reorder the candidate characters; and then, grammar rule matching is carried out by considering application scenes, and a final voice recognition transcription result is obtained.

The above-described embodiments are not intended to limit the scope of the present invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A speech recognition processing method, comprising:

2. The method of claim 1, further comprising, prior to the invoking error detector detecting an erroneous character in the transcription result:

3. The method according to claim 1 or 2, wherein the calculating of the pronunciation feature distance of each candidate character and the wrong character comprises:

4. The method of claim 1, wherein obtaining the corrected transcript text and returning it to display further comprises:

and in response to the finding result being existence, determining characters to be replaced corresponding to the characters in the transcribed text in the regular transcription dictionary, so as to replace the characters in the transcribed text with the characters to be replaced.

5. A speech recognition processing apparatus, comprising:

the prediction module is used for calling a pre-training language model, carrying out mask processing on the position of the error character in the transcription result, predicting candidate characters at the mask position and calculating the probability of each candidate character appearing at the mask position;

6. The apparatus of claim 5, further comprising a training module to:

receiving the marking operation of the error characters in the transcription result of each sample, and carrying out fine tuning training on the pre-training language model by taking the sample voice information and the correspondingly marked error characters as input to obtain an error detector;

7. The apparatus of claim 5 or 6, wherein the computing module is configured to:

8. The apparatus of claim 5, wherein the replacement module is further configured to:

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.