CN112542162A - Voice recognition method and device, electronic equipment and readable storage medium - Google Patents

Voice recognition method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112542162A
CN112542162A CN202011402934.XA CN202011402934A CN112542162A CN 112542162 A CN112542162 A CN 112542162A CN 202011402934 A CN202011402934 A CN 202011402934A CN 112542162 A CN112542162 A CN 112542162A
Authority
CN
China
Prior art keywords
probability
candidate
candidate sentence
sentence
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011402934.XA
Other languages
Chinese (zh)
Other versions
CN112542162B (en
Inventor
赖勇铨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Citic Bank Corp Ltd
Original Assignee
China Citic Bank Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Citic Bank Corp Ltd filed Critical China Citic Bank Corp Ltd
Priority to CN202011402934.XA priority Critical patent/CN112542162B/en
Publication of CN112542162A publication Critical patent/CN112542162A/en
Application granted granted Critical
Publication of CN112542162B publication Critical patent/CN112542162B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/081Search algorithms, e.g. Baum-Welch or Viterbi
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, which are applied to the technical field of voice recognition, wherein the method comprises the following steps: the pre-trained mask-based neural network model breaks through the limitation of the n-gram model, and context information of the whole sentence can be utilized, so that the category and the probability of characters corresponding to each position in the candidate sentences can be obtained more accurately, the probability of each candidate sentence determined by cluster searching is further determined, the candidate sentences are reordered, and the result of voice recognition is more accurate.

Description

Voice recognition method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a readable storage medium.
Background
The bundle searching is a breadth-first heuristic searching algorithm and is used in path searching. Assuming that there are three nodes, each of which may take the value abc, all possible paths include aaa, aab, aac. For the sake of efficiency and storage space, the bundle searching algorithm first expands from width to build a candidate list, where the capacity of the list is at most w, which is also called beam width, i.e. the width of the bundle.
For the above problem, let w be 2, i.e. the current list retains only two most probable paths after each step of search. Then a complete bundle search procedure is as follows: firstly, considering the sequence of a, b and c, selecting two combinations corresponding to the maximum probability as b and c, selecting the two combinations with the maximum probability from the two combinations, arranging the two combinations from high to low, and updating the two combinations into a list; the second step considers the following 6 cases, ba, bb, bc, ca, cb, cc, from which the two combinations with the highest probability are selected and ranked from high to low, assuming bc, ca, updated in the list; the third step considers the following 6 cases, bca, bcb, bcc, caa, cab, cac, from which the two combinations with the highest probability are selected and ranked from high to low, assuming caa, cac; and finishing the search, and outputting caa and cac as a final bundle search result.
The probabilities between the combinations involved in the above calculation process can be obtained by an n-gram language model. Taking a 2-gram as an example, the probability that a combination frequency of characters of 2 nd order or less is used to represent the combination is usually calculated from a large corpus. Assuming a total of three words, a, b, c, then the 2-gram would have a combined probability value by counting the large number of text corpora as follows:
a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc. The probability calculation in the search process is obtained by table lookup, for example, the probability of abc combination is calculated and then decomposed into modulus values ab and bc, which are multiplied.
The cluster searching enhances the voice recognition effect through a language model of n-gram, and ngram is realized through table lookup. In practical applications, for an audio input, the bundle search outputs a list of sentences, each sentence in the list representing a possible transcription result. The sentences of the list are sorted from high to low according to the probability, and the probability value is obtained by weighting the probabilities of the acoustic model and the ngram language model respectively. The n-gram language model belongs to a local model, and has the advantages of high efficiency and incapability of realizing long context understanding. The method has the disadvantages that the n-gram model is difficult to model a long sentence, cannot utilize context information of the whole sentence, and is not accurate enough for understanding the context.
Disclosure of Invention
The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, which are used for breaking through the limitation of an n-gram model and utilizing context information of a whole sentence, so that the category and the probability of characters corresponding to each position in a candidate sentence can be obtained more accurately, the probability of each candidate sentence determined by cluster search is further determined, the candidate sentences are reordered, and the voice recognition result is more accurate.
The technical scheme adopted by the application is as follows:
in a first aspect, a speech recognition method is provided, which includes:
acquiring a candidate sentence list obtained by carrying out voice recognition on target audio based on a cluster searching method, wherein the candidate sentence list comprises a plurality of candidate sentences;
determining the probability of each candidate statement in the candidate statement list; determining a probability for each candidate sentence in the candidate sentence list, comprising: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence;
and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list.
Optionally, determining the probability of each word in any candidate sentence based on the pre-trained mask-based neural network model includes:
determining character types and probabilities of all positions based on a pre-trained mask-based neural network model;
determining the probability of each character in any candidate sentence based on the character category and the probability of each position;
determining the probability of any candidate sentence based on the probability of each character in any candidate sentence, comprising:
and taking the product of the probability values of the characters at all positions in the candidate sentences as the probability of any candidate sentence.
Optionally, calculating the text category and probability of each position based on the pre-trained mask-based neural network model includes:
erasing the characters at any position in a mask mode to obtain any candidate sentence with the characters at any position erased;
and inputting any candidate statement erased from any position of characters into a pre-trained neural network model based on a mask to obtain the character category and probability of any position.
Optionally, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying the text corresponding to the position where the mask is erased.
Optionally, the pre-trained mask-based neural network model is a time-series sequence-based neural network model.
Optionally, the method further comprises:
and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.
In a second aspect, a speech recognition apparatus is provided, including:
the acquisition module is used for acquiring a candidate sentence list obtained by performing voice recognition on a target audio based on a cluster search model, wherein the candidate sentence list comprises a plurality of candidate sentences;
the determining module is used for determining the probability of each candidate statement in the candidate statement list; the determining module is specifically used for determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence;
and the reordering module is used for reordering each candidate statement in the candidate statement list based on the determined probability of each candidate statement in the candidate statement list to obtain a reordered target candidate statement list.
Optionally, the determining module includes:
the first determining unit is used for determining the character type and probability of each position based on a pre-trained mask-based neural network model;
the second determining unit is used for determining the probability of each character in any candidate sentence based on the character type and the probability of each position;
and the unit is used for taking the product of the probability values of the characters at all positions in the candidate sentences as the probability of any candidate sentence.
Optionally, the first determining unit is specifically configured to erase the text at any position in a mask manner, so as to obtain any candidate statement with the text at any position erased; and inputting any candidate statement for erasing the characters at any position into a pre-trained neural network model based on a mask to obtain the character category and probability at any position.
Optionally, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying the text corresponding to the position where the mask is erased.
Optionally, the pre-trained mask-based neural network model is a time-series sequence-based neural network model.
Optionally, the apparatus further comprises: and the module is used for taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as the voice recognition result of the target audio.
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the speech recognition method shown in the first aspect is performed.
In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions which, when run on a computer, cause the computer to perform the speech recognition method shown in the first aspect.
Compared with the prior art that the long sentence is difficult to model by performing voice recognition through an n-gram language model in cluster search, the context information of the whole sentence cannot be utilized, and the understanding of the context is usually not accurate enough, the method and the device for recognizing the voice have the advantages that a candidate sentence list obtained by performing voice recognition on target audio based on the cluster search method is obtained, and the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate statement in the candidate statement list; determining a probability for each candidate sentence in the candidate sentence list, comprising: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The pre-trained mask-based neural network model breaks through the limitation of the n-gram model, and context information of the whole sentence can be utilized, so that the category and the probability of characters corresponding to each position in the candidate sentences can be obtained more accurately, the probability of each candidate sentence is further determined, the candidate sentences are reordered, and the result of voice recognition is more accurate.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
FIG. 3 is a diagram illustrating an example of probability identification determination according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Example one
An embodiment of the present application provides a speech recognition method, as shown in fig. 1, the method may include the following steps:
step S101, acquiring a candidate sentence list obtained by performing voice recognition on a target audio based on a cluster searching method, wherein the candidate sentence list comprises a plurality of candidate sentences;
in particular, in practical applications, for a target audio input, the bundle search outputs a list of sentences, each sentence in the list representing a possible transcription result. The sentences of the list are sorted from high to low according to the probability, and the probability value is obtained by weighting the probability of each acoustic model and the probability of each n-gram language model. The n-gram language model belongs to a local model, and has the advantages of high efficiency and incapability of realizing long context understanding. The method has the disadvantages that the n-gram model is difficult to model a long sentence, cannot utilize context information of the whole sentence, and is not accurate enough for understanding the context.
Such as for a speech text: "suddenly raining today, but i forgot to take an umbrella when they go out", assuming that the last two words are not clearly recorded enough due to noise or other reasons, the cluster search algorithm may get the following output list:
"it rains suddenly today, but I don't bring three"
"it rains suddenly today, but I do not stay away"
"it rains suddenly today, but I do not have an umbrella"
The output sentences in the list are different in the last two words, because the umbrella-carrying language model and the rainfall language model differ by 6 characters, if the context relationship is required to be utilized in the ngram language model, the length of the context is required to be at least 10 (including 4 words of umbrella-carrying language model and rainfall language model), that is, the 10-order language model is required, which is obviously unrealistic (the normal n-gram language model can reach 5 orders at most), and how to break through the limitation of the n-gram model becomes a problem. It is the pre-trained mask-based neural network language model that is utilized to reorder and essentially correct the list of bundled search outputs described above.
Step S102, determining the probability of each candidate sentence in the candidate sentence list; determining a probability for each candidate sentence in the candidate sentence list, comprising: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence;
the probability of each character in any candidate sentence is determined through the pre-trained mask-based neural network model, and the limitation that the n-gram model cannot utilize context information of the whole sentence and the understanding of the context is not accurate enough is broken through.
Specifically, the pre-trained Mask-based neural network model of the present application may be a masked language model (Mask LM) or other models that implement the functions of the present application. The training process of the Mask LM is similar to the blank filling of a text, an input sentence I with blanks is processed through a deep network and then a word P is output, and the aim is to enable the word P to be as close as possible to a real sentence T. For example as shown below, in the following,
i, if you only need right key to click each disc symbol, choose "[ ] to formalize [ ], namely [ ]
P, if you have, you only need right click each drive letter, choose "format" can
If yes, you only need right click each drive letter and select' format
Through training, the mask LM can predict the text at any position. The implementation of Mask LM network is not limited to RNN/GRU/LSTM isochronous sequence model or its improved version of attention device, but also includes transformer (BERT, gpt) network structure.
Step S103, based on the determined probability of each candidate sentence in the candidate sentence list, reordering each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list.
Illustratively, the probabilities of sentences of ' raining suddenly today, but not taking three ' today ' and ' raining suddenly today, but not staying away ', and ' raining suddenly today, but not taking umbrellas ' are obtained respectively, and then the sentences can be reordered according to the sizes of the probability values, so that the obtained target candidate sentence results are more accurate.
The embodiment of the present application provides a possible implementation manner, and specifically, determining a probability of occurrence of each word in any candidate sentence based on a pre-trained mask-based neural network model, including:
determining character types and probabilities of all positions based on a pre-trained mask-based neural network model; specifically, erasing the characters at any position in a mask mode to obtain any candidate sentence with the characters at any position erased; and inputting any candidate statement erased from any position of characters into a pre-trained neural network model based on a mask to obtain the character category and probability of any position. Specifically, the last layer of the pre-trained mask-based neural network model is a softmax activation function, which is used for classifying the text corresponding to the position where the mask is erased. Illustratively, like my love-country in fig. 3, the "-" location-corresponding text may be "medium, american, large.
Determining the probability of each character in any candidate sentence based on the character category and the probability of each position;
determining the probability of any candidate sentence based on the probability of each character in any candidate sentence, comprising:
and taking the product of the probability values of the characters at all positions in the candidate sentences as the probability of any candidate sentence.
The embodiment of the application provides a possible implementation manner, and specifically, the pre-trained mask-based neural network model is a time sequence-based neural network model.
In particular, for each character in the statement, it is erased, i.e., replaced with a "-" (or other non-visible symbol) by way of a mask. And then predicting all possible characters (including actual characters) of the position and probability values thereof through MaskLM, and finally taking the probability corresponding to the actual text of the position as a score corresponding to the position in the sentence. The final total score is the product of the score values (probability values) for all locations.
For example, to calculate the probability of the sentence "puncture or very ocean-qi", the following decomposition is performed:
wearing deviceIt is also very ocean-qi: 0.1
Wearing deviceGet upAlso very ocean-qi: 0.2
Put on and get upAnd alsoIs very ocean-qi: 0.3
Put on and get up stillIs thatVery ocean-qi: 0.6
Whether to put on or put onVery muchThe ocean gas is as follows: 0.7
Is worn up or downOceanThe ratio of gas: 0.2
Put on or very openQi (Qi)The following steps: 0.3
Whether it is a very ocean qiIs/are as follows:0.5
For each position of the sentence, a probability value of the word is calculated (as indicated by the number following the colon), and finally the probability values of the sentence are obtained by multiplying all the values. P(s) is used to represent probability value of sentence
Figure BDA0002817533660000101
Wherein n is the length of the sentence, i is the position, the probability value of the character corresponding to the position i is Pi, and the value of Pi is obtained by the language model through prediction based on the characters at other positions.
Fig. 3 shows how to calculate Pi of the sentence "i love china", and the probability of the "middle" word in the sentence is 0.99, i.e. the probability value output by the last layer softmax. If the sentence is "I love you's country", the probability value of "you" is 0.0001.
And (4) reordering the sentences, and respectively calculating the probabilities P (S1), P (S2) and …. P (Sm) of the sentence lists obtained by the bundle search, wherein the sentence lists are assumed to be S1, S2, S3 and …. Sm.
And finally, sequencing the sentences from large to small according to the probability value to obtain the reordered sentences.
The embodiment of the present application provides a possible implementation manner, and specifically, the method further includes:
and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.
Compared with the prior art that the long sentence is difficult to model by performing voice recognition through an n-gram language model in cluster search, the context information of the whole sentence cannot be utilized, and the understanding of the context is not accurate enough, the method for recognizing the voice comprises the steps of obtaining a candidate sentence list obtained by performing voice recognition on target audio based on the cluster search method, wherein the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate statement in the candidate statement list; determining a probability for each candidate sentence in the candidate sentence list, comprising: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The pre-trained mask-based neural network model breaks through the limitation of the n-gram model, and context information of the whole sentence can be utilized, so that the category and the probability of characters corresponding to each position in the candidate sentences can be obtained more accurately, the probability of each candidate sentence is further determined, the candidate sentences are reordered, and the result of voice recognition is more accurate.
Example two
Fig. 2 is a speech recognition apparatus according to an embodiment of the present application, where the apparatus 20 includes: an acquisition module 201, a determination module 202, a reordering module 203, wherein,
an obtaining module 201, configured to obtain a candidate sentence list obtained by performing voice recognition on a target audio based on a cluster search model, where the candidate sentence list includes a plurality of candidate sentences;
a determining module 202, configured to determine a probability of each candidate sentence in the candidate sentence list; the determining module is specifically used for determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence;
and the reordering module 203 is configured to reorder, based on the determined probability of each candidate statement in the candidate statement list, and obtain a reordered target candidate statement list.
Optionally, the determining module includes:
the first determining unit is used for determining the character type and probability of each position based on a pre-trained mask-based neural network model;
the second determining unit is used for determining the probability of each character in any candidate sentence based on the character type and the probability of each position;
and the unit is used for taking the product of the probability values of the characters at all positions in the candidate sentences as the probability of any candidate sentence.
Optionally, the first determining unit is specifically configured to erase the text at any position in a mask manner, so as to obtain any candidate statement with the text at any position erased; and inputting any candidate statement for erasing the characters at any position into a pre-trained neural network model based on a mask to obtain the character category and probability at any position.
Optionally, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying the text corresponding to the position where the mask is erased.
Optionally, the pre-trained mask-based neural network model is a time-series sequence-based neural network model.
Optionally, the apparatus 20 further includes a module configured to use the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a speech recognition result of the target audio.
Compared with the prior art that the long sentence is difficult to model by performing voice recognition through an n-gram language model in cluster search, the context information of the whole sentence cannot be utilized, and the understanding of the context is not accurate enough, the voice recognition device obtains a candidate sentence list obtained by performing voice recognition on target audio based on a cluster search method, wherein the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate statement in the candidate statement list; determining a probability for each candidate sentence in the candidate sentence list, comprising: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The pre-trained mask-based neural network model breaks through the limitation of the n-gram model, and context information of the whole sentence can be utilized, so that the category and the probability of characters corresponding to each position in the candidate sentences can be obtained more accurately, the probability of each candidate sentence is further determined, the candidate sentences are reordered, and the result of voice recognition is more accurate.
The apparatus of the embodiment of the present application can execute the method shown in the first embodiment of the present application, and the implementation effect is similar, which is not described herein again.
EXAMPLE III
An embodiment of the present application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 401 and a memory 403. Wherein the processor 401 is coupled to the memory 403, such as via a bus 402. Further, the electronic device 40 may also include a transceiver 404. It should be noted that the transceiver 404 is not limited to one in practical applications, and the structure of the electronic device 40 is not limited to the embodiment of the present application. The processor 401 is applied in the embodiment of the present application, and is used to implement the functions of the modules shown in fig. 2. The transceiver 404 includes a receiver and a transmitter.
The processor 401 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 401 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 402 may include a path that transfers information between the above components. The bus 402 may be a PCI bus or an EISA bus, etc. The bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.
The memory 403 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 403 is used for storing application program codes for executing the scheme of the application, and the execution is controlled by the processor 401. The processor 401 is configured to execute application program code stored in the memory 403 to implement the functions of the apparatus provided by the embodiment shown in fig. 2.
The embodiment of the present application provides an electronic device suitable for the above method embodiment, and specific implementation manners and technical effects are not described herein again.
Example four
The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the speech recognition method shown in the above embodiment.
The embodiment of the present application provides a computer-readable storage medium suitable for the above method embodiment, and specific implementation manners and technical effects are not described herein again.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. A speech recognition method, comprising:
acquiring a candidate sentence list obtained by carrying out voice recognition on a target audio based on a cluster search method, wherein the candidate sentence list comprises a plurality of candidate sentences;
determining the probability of each candidate statement in the candidate statement list; the determining the probability of each candidate sentence in the candidate sentence list comprises: determining the probability of each character in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each character in any candidate sentence;
and reordering each candidate statement in the candidate statement list based on the determined probability of each candidate statement in the candidate statement list to obtain a reordered target candidate statement list.
2. The method of claim 1, wherein determining the probability of each word occurring in any candidate sentence based on the pre-trained mask-based neural network model comprises:
determining character types and probabilities of all positions based on a pre-trained mask-based neural network model;
determining the probability of each character in any candidate sentence based on the character category and the probability of each position;
the determining the probability of any candidate sentence based on the probability of occurrence of each word in the any candidate sentence comprises:
and taking the product of the probability values of the characters at all positions in the candidate sentences as the probability of any candidate sentence.
3. The method of claim 2, wherein the pre-trained mask-based neural network model calculates the text class and probability for each location, comprising:
erasing the characters at any position in a mask mode to obtain any candidate sentence with the characters at any position erased;
and inputting any candidate statement erased from the characters at any position into the pre-trained neural network model based on the mask to obtain the character category and probability of any position.
4. The method of claim 3, wherein a last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying the text corresponding to the location where the mask is erased.
5. The method of any of claims 1-4, wherein the pre-trained mask-based neural network model is a time-series sequence-based neural network model.
6. The method of claim 5, further comprising:
and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as the voice recognition result of the target audio.
7. A speech recognition model, comprising:
the acquisition module is used for acquiring a candidate sentence list obtained by performing voice recognition on a target audio based on a cluster search model, wherein the candidate sentence list comprises a plurality of candidate sentences;
a determining module, configured to determine a probability of each candidate sentence in the candidate sentence list; the determining module is specifically configured to determine, based on a pre-trained mask-based neural network model, a probability of occurrence of each word in any candidate sentence, and determine, based on the probability of occurrence of each word in any candidate sentence, the probability of any candidate sentence;
and the reordering module is used for reordering each candidate statement in the candidate statement list based on the determined probability of each candidate statement in the candidate statement list to obtain a reordered target candidate statement list.
8. The model of claim 7, wherein said determining module comprises:
the first determining unit is used for determining the character type and probability of each position based on a pre-trained mask-based neural network model;
a second determining unit, configured to determine, based on the text type and the probability at each position, a probability of occurrence of each text in any candidate sentence;
and the unit is used for taking the product of the probability values of the characters at all positions in the candidate sentences as the probability of any candidate sentence.
9. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: -performing a speech recognition method according to any of claims 1 to 6.
10. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the speech recognition method of any of claims 1 to 6.
CN202011402934.XA 2020-12-04 2020-12-04 Speech recognition method, device, electronic equipment and readable storage medium Active CN112542162B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011402934.XA CN112542162B (en) 2020-12-04 2020-12-04 Speech recognition method, device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011402934.XA CN112542162B (en) 2020-12-04 2020-12-04 Speech recognition method, device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112542162A true CN112542162A (en) 2021-03-23
CN112542162B CN112542162B (en) 2023-07-21

Family

ID=75015789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011402934.XA Active CN112542162B (en) 2020-12-04 2020-12-04 Speech recognition method, device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112542162B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011075602A (en) * 2009-09-29 2011-04-14 Brother Industries Ltd Device, method and program for speech recognition
US20190139540A1 (en) * 2016-06-09 2019-05-09 National Institute Of Information And Communications Technology Speech recognition device and computer program
CN110517693A (en) * 2019-08-01 2019-11-29 出门问问(苏州)信息科技有限公司 Audio recognition method, device, electronic equipment and computer readable storage medium
CN111145729A (en) * 2019-12-23 2020-05-12 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN111933129A (en) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 Audio processing method, language model training method and device and computer equipment
CN112017645A (en) * 2020-08-31 2020-12-01 广州市百果园信息技术有限公司 Voice recognition method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011075602A (en) * 2009-09-29 2011-04-14 Brother Industries Ltd Device, method and program for speech recognition
US20190139540A1 (en) * 2016-06-09 2019-05-09 National Institute Of Information And Communications Technology Speech recognition device and computer program
CN110517693A (en) * 2019-08-01 2019-11-29 出门问问(苏州)信息科技有限公司 Audio recognition method, device, electronic equipment and computer readable storage medium
CN111145729A (en) * 2019-12-23 2020-05-12 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN112017645A (en) * 2020-08-31 2020-12-01 广州市百果园信息技术有限公司 Voice recognition method and device
CN111933129A (en) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 Audio processing method, language model training method and device and computer equipment

Also Published As

Publication number Publication date
CN112542162B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
CN110210029B (en) Method, system, device and medium for correcting error of voice text based on vertical field
CN108647205B (en) Fine-grained emotion analysis model construction method and device and readable storage medium
CN111145718B (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
CN102640089B (en) The text input system of electronic equipment and text entry method
US7412093B2 (en) Hybrid apparatus for recognizing answer type
EP1619620A1 (en) Adaptation of Exponential Models
US20130041857A1 (en) System and method for inputting text into electronic devices
US20170061958A1 (en) Method and apparatus for improving a neural network language model, and speech recognition method and apparatus
EP1696422A2 (en) Method for converting phonemes to written text and corresponding computer system and computer program
CN103854643A (en) Method and apparatus for speech synthesis
US5553284A (en) Method for indexing and searching handwritten documents in a database
US8019593B2 (en) Method and apparatus for generating features through logical and functional operations
CN114780691B (en) Model pre-training and natural language processing method, device, equipment and storage medium
JP2002082689A (en) Recognition system using lexical tree
CN111444719A (en) Entity identification method and device and computing equipment
CN109284358B (en) Chinese address noun hierarchical method and device
CN112185361B (en) Voice recognition model training method and device, electronic equipment and storage medium
CN113238797A (en) Code feature extraction method and system based on hierarchical comparison learning
CN117709355B (en) Method, device and medium for improving training effect of large language model
CN111428487B (en) Model training method, lyric generation method, device, electronic equipment and medium
KR100542757B1 (en) Automatic expansion Method and Device for Foreign language transliteration
JP6605997B2 (en) Learning device, learning method and program
CN111026848B (en) Chinese word vector generation method based on similar context and reinforcement learning
CN117112916A (en) Financial information query method, device and storage medium based on Internet of vehicles
CN112632956A (en) Text matching method, device, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant