CN112802461A - Speech recognition method and device, server, computer readable storage medium - Google Patents

Speech recognition method and device, server, computer readable storage medium Download PDF

Info

Publication number
CN112802461A
CN112802461A CN202011607655.7A CN202011607655A CN112802461A CN 112802461 A CN112802461 A CN 112802461A CN 202011607655 A CN202011607655 A CN 202011607655A CN 112802461 A CN112802461 A CN 112802461A
Authority
CN
China
Prior art keywords
acoustic
decoding network
decoding
sub
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011607655.7A
Other languages
Chinese (zh)
Other versions
CN112802461B (en
Inventor
周维聪
袁丁
赵金昊
刘云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN202011607655.7A priority Critical patent/CN112802461B/en
Publication of CN112802461A publication Critical patent/CN112802461A/en
Application granted granted Critical
Publication of CN112802461B publication Critical patent/CN112802461B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a voice recognition method and device, a server and a computer readable storage medium, comprising: and extracting acoustic features of the voice data to be processed, inputting the extracted acoustic features into an acoustic model, and calculating the score of the acoustic model of the acoustic features. And decoding the acoustic characteristics and the acoustic model scores of the acoustic characteristics by adopting a main decoding network and a sub decoding network to obtain a voice recognition result. According to the voice recognition method, a decoding network is not retrained in a scene to be recognized, but a target named entity in the scene to be recognized is trained to obtain a sub-decoding graph, and then a main decoding network and the sub-decoding network are adopted to decode to obtain a voice recognition result. Therefore, the target named entity in the scene to be identified can be accurately decoded based on the sub-decoding network. And because the decoding network is not retrained for the scene to be recognized, the training time is greatly shortened, and the speech recognition efficiency is improved.

Description

Speech recognition method and device, server, computer readable storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a speech recognition method and apparatus, a server, and a computer-readable storage medium.
Background
With the continuous development of artificial intelligence and natural language processing technology, speech recognition technology has also been rapidly developed. The audio signal can be automatically converted into corresponding text or commands using speech recognition techniques. The traditional voice recognition technology can be applied to common and daily voice recognition scenes and obtains a good recognition effect.
However, when the method is applied to a professional scene, the recognition effect is poor due to the fact that a large number of professional vocabularies are contained in the professional scene. If the decoding network is retrained specially for the professional scene for voice recognition, obviously, the work load of retraining the decoding graph is large, the training time is long, and the method cannot be realized quickly.
Disclosure of Invention
The embodiment of the application provides a voice recognition method, a voice recognition device, a server and a computer readable storage medium, which can reduce the workload of retraining a decoding graph, greatly shorten the training time and improve the efficiency of voice recognition when voice recognition is carried out in a specific application scene.
A speech recognition method comprising:
extracting acoustic features of the voice data to be processed;
inputting the extracted acoustic features into an acoustic model, and calculating an acoustic model score of the acoustic features;
decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized.
A speech recognition apparatus, the apparatus comprising:
the acoustic feature extraction module is used for extracting acoustic features of the voice data to be processed;
an acoustic model score calculation module, configured to input the extracted acoustic features into an acoustic model, and calculate an acoustic model score of the acoustic features;
the decoding module is used for decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized.
A server comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the above method.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as above.
The voice recognition method, the voice recognition device, the server and the computer-readable storage medium extract the acoustic features of the voice data to be processed, input the extracted acoustic features into the acoustic model and calculate the score of the acoustic model of the acoustic features. And decoding the acoustic characteristics and the acoustic model scores of the acoustic characteristics by adopting a main decoding network and a sub decoding network to obtain a voice recognition result. The sub-decoding network is a decoding network obtained by training the target named entity in the scene to be recognized. According to the voice recognition method, a decoding network is not retrained in a scene to be recognized, but a target named entity in the scene to be recognized is trained to obtain a sub-decoding network, and then a main decoding network and the sub-decoding network are adopted to decode acoustic features and acoustic model scores of the acoustic features to obtain a voice recognition result. Therefore, the target named entity in the scene to be identified can be accurately decoded based on the sub-decoding network. And because the decoding network is not retrained for the scene to be recognized, the training time is greatly shortened, and the speech recognition efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a diagram illustrating an exemplary implementation of a speech recognition method;
FIG. 2 is a flow diagram of a method of speech recognition in one embodiment;
FIG. 3 is a flow diagram of a method for generating a primary decoding network in one embodiment;
FIG. 4 is a schematic diagram of a portion of a primary decoding network in one embodiment;
FIG. 5 is a flow diagram of a method for generating a sub-decoding network in one embodiment;
FIG. 6 is a diagram illustrating the structure of a speech recognition lattice according to an embodiment;
FIG. 7 is a flowchart illustrating a method for obtaining lattice parameters of speech recognition by using a main decoding network and a sub decoding network for decoding according to an embodiment;
FIG. 8 is a diagram illustrating an embodiment of decoding with a main decoding network and a sub-decoding network to obtain a speech recognition lattice;
FIG. 9 is a flowchart of a method for obtaining speech recognition results based on speech recognition lattice in one embodiment;
FIG. 10 is a block diagram showing the structure of a speech recognition apparatus according to an embodiment;
FIG. 11 is a block diagram showing the construction of a speech recognition apparatus according to another embodiment;
fig. 12 is a schematic diagram of the internal structure of the server in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.
Fig. 1 is a diagram illustrating an application scenario of the speech recognition method according to an embodiment. As shown in fig. 1, the application environment includes a terminal 120 and a server 140, and the terminal 120 and the server 140 are connected through a network. The server 140 performs acoustic feature extraction on the voice data to be processed by using the voice recognition method in the application; inputting the extracted acoustic features into an acoustic model, and calculating an acoustic model score of the acoustic features; decoding the acoustic characteristics and the acoustic model scores of the acoustic characteristics by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized. Here, the terminal 120 may be any terminal device such as a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), an in-vehicle computer, a wearable device, and a smart home.
Fig. 2 is a flowchart of a speech recognition method in an embodiment, and as shown in fig. 2, a speech recognition method is provided and applied to a server, and includes steps 220 to 260. Wherein the content of the first and second substances,
step 220, extracting acoustic features of the voice data to be processed.
The voice data may refer to an acquired audio signal. Specifically, the audio signal may be an audio signal acquired in a voice input scene, an intelligent chat scene, or a voice translation scene. And extracting acoustic features of the voice data to be processed. The specific process of acoustic feature extraction may be: and converting the acquired one-dimensional audio signals into a group of high-dimensional vectors through a feature extraction algorithm. The obtained high-dimensional vector is an acoustic feature, and common acoustic features include MFCC, Fbank, vector, and the like, which are not limited in the present application. Fbank (filterbank) is a front-end processing algorithm, which processes audio in a manner similar to human ears, and can improve the performance of speech recognition. The general steps for obtaining the Fbank characteristics of the voice signal are as follows: pre-emphasis, framing, windowing, short-time fourier transform (STFT), mel-filtering, de-averaging, etc. And the MFCC characteristics can be obtained by performing Discrete Cosine Transform (DCT) on the Fbank.
The Mel-frequency cepstral coefficients are extracted based on the auditory characteristics of human ears, and therefore, the Mel frequency and the Hz frequency form a nonlinear correspondence. The mel-frequency cepstrum coefficient (MFCC) is the Hz frequency spectrum feature calculated by utilizing the nonlinear corresponding relation between the mel frequency and the Hz frequency. MFCC is mainly used for speech data feature extraction and reduces the operational dimension. For example: for a frame with 512-dimensional (sampling point) data, the most important 40-dimensional data can be extracted after MFCC, thereby achieving the purpose of reducing dimensions. Wherein, vector is the feature vector describing each speaker.
Step 240, inputting the extracted acoustic features into an acoustic model, and calculating an acoustic model score of the acoustic features.
Specifically, the acoustic model may include a neural network model and a hidden markov model, where the neural network model may provide acoustic modeling units to the hidden markov model, and the granularity of the acoustic modeling units may include: words, syllables, phonemes, or states, etc. The hidden Markov model can determine the phoneme sequence according to an acoustic modeling unit provided by the neural network model. A state mathematically characterizes the state of a markov process. And the acoustic model is a model obtained by training in advance according to the audio training corpus.
The extracted acoustic features are input into an acoustic model, and an acoustic model score of the acoustic features can be calculated. Here, the acoustic model score may be regarded as a score calculated according to a probability of occurrence of each phoneme under each acoustic feature.
Step 260, decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized.
The decoding network is used for finding the best decoding path under the condition of giving the phoneme sequence, and then the voice recognition result can be obtained. In the embodiment of the application, the adopted decoding networks include a main decoding network and a sub-decoding network, the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized. In this way, the phoneme sequence of the named entity can be decoded by using the main decoding network, and the phoneme sequence of the target named entity can be decoded by using the sub-decoding network. Therefore, the main decoding network and the sub decoding network are adopted to decode the acoustic features and the acoustic model scores of the acoustic features, and then the voice recognition result is obtained.
The target named entities in the scene to be recognized include named entities with voice recognition error rates exceeding a preset error rate threshold value and professional vocabularies in the scene to be recognized, which are specified manually in the scene to be recognized.
In the embodiment of the application, aiming at the technical problem of low accuracy of voice recognition in a specific application scene, a voice recognition method is provided, acoustic features of voice data to be processed are extracted, the extracted acoustic features are input into an acoustic model, and the score of the acoustic model of the acoustic features is calculated. And decoding the acoustic characteristics and the acoustic model scores of the acoustic characteristics by adopting a main decoding network and a sub decoding network to obtain a voice recognition result. According to the voice recognition method, a decoding network is not retrained for a scene to be recognized, but a target named entity in the scene to be recognized is trained to obtain a sub-decoding graph, and then a main decoding network and the sub-decoding network are adopted to decode acoustic features and acoustic model scores of the acoustic features to obtain a voice recognition result. Therefore, the target named entity in the scene to be identified can be accurately decoded based on the sub-decoding network. And because the decoding network is not retrained for the scene to be recognized, the training time is greatly shortened, and the speech recognition efficiency is improved.
In one embodiment, as shown in fig. 3, the generation process of the main decoding network includes:
and 320, hollowing out the named entities in the original text training corpus to obtain the target text training corpus.
The obtaining of the original text corpus may be obtaining the original text corpus from a corpus. The original text corpus approximately contains 700-. The language material base is used for storing the language material which is actually appeared in the practical use of the language, and the language material base is used for bearing the basic resource of language knowledge by taking an electronic computer as a carrier. Typically, the actual corpus needs to be processed (e.g., analyzed and processed) to become a useful resource. Named entity (named entity), as the name implies, named entity is the name of a person, organization, place, and all other entities identified by name, and the broader entities include numbers, dates, currencies, addresses, and the like.
Because named entities in different application scenarios are greatly different and may not be corpora contained in the original text corpus. Therefore, in order to improve the accuracy of speech recognition in a specific application scenario, a named entity in an original text corpus may be hollowed out, and the hollowed-out position is represented by a hollow node. Thus, the corpus not containing the named entities is left, and the corpus not containing the named entities constitutes the target text corpus.
And 340, training the target text training corpus to obtain a language model.
And after the target text training corpus is obtained, training the target text training corpus to obtain a language model. The language model may be trained by using a Recurrent Neural Network, and may also be referred to as RNNLM (Recurrent Neural Network Based language model). The loop network language model may consider a plurality of words input before in addition to the currently input word, and may calculate the probability of the next word occurring according to a long text composed of the words input before, so the loop network language model has a "better memory effect". For example, "good" may appear after "my" and "mood" and "bad" may appear, and the appearance of these words depends on the previous appearance of "my" and "mood", which is "memory effect".
And step 360, training the voice training corpus corresponding to the original text training corpus to obtain the acoustic model.
And then, acquiring a voice training corpus corresponding to the original text training corpus, wherein the voice training corpus and each corpus in the original text training corpus have a corresponding relation. That is, each text corpus in the original text corpus can obtain the corresponding speech corpus in the speech corpus.
The acoustic model uses a deep neural network model to model the mapping between acoustic pronunciations and basic acoustic units (typically phonemes). Wherein, the phoneme is the minimum voice unit divided according to the natural attribute of the voice. The acoustic model may receive the input acoustic features and output a phoneme sequence corresponding to the acoustic features. And extracting acoustic features aiming at the voice corpora in the voice database, and training an acoustic model according to the extracted acoustic features so as to obtain the acoustic model.
And 380, combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus.
The named entities in the original text training corpus are hollowed to obtain a target text training corpus, and the target text training corpus is trained to obtain a language model. And training the voice training corpus corresponding to the original text training corpus to obtain the acoustic model. Therefore, combining the language model with the acoustic model can obtain the main decoding network. The main decoding network comprises nodes corresponding to the training corpora except the named entities, the named entities are represented by adopting empty nodes, and the empty nodes correspond to hollowed positions in the target text training corpora, namely the positions corresponding to the named entities.
Fig. 4 is a schematic diagram of a partial structure of a main decoding network in one embodiment. Wherein, a word sequence comprises nodes and jump edges, wherein the nodes comprise a starting node, a middle node and a terminating node. As shown in fig. 4, node 1 is the start node, nodes 2, 3, 4, and 5 are intermediate nodes, and node 6 is the end node. A jump edge is connected between the starting node and the terminating node, and word information and phoneme information are carried on the jump edge. Wherein, the skip edge between the node 1 and the node 2 carries word information: hello; the phoneme information is carried as follows: n is the same as the formula (I). The skip edge between node 2 and node 3 carries word information: blank; the phoneme information is carried as follows: i. the skip edge between node 3 and node 4 carries word information: blank; the phoneme information is carried as follows: h. the skip edge between node 4 and node 5 carries word information: blank; the phoneme information is carried as follows: ao (a). The node 5 is a hollow node obtained by hollowing out the named entity, and therefore, the skip edge between the node 5 and the node 6 does not carry word information nor phoneme information.
In the embodiment of the application, the named entities in the original text training corpus are hollowed to obtain a target text training corpus, and the target text training corpus is trained to obtain a language model. And training the voice training corpus corresponding to the original text training corpus to obtain an acoustic model. And combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus. The main decoding network obtained after the named entities are hollowed can be combined with the sub-decoding network obtained by training the target named entities in any specific scene, so that the method is suitable for voice recognition in any specific scene, and the accuracy and efficiency of the voice recognition in any specific scene are improved.
In one embodiment, combining the language model with the acoustic model to obtain a main decoding network comprises:
and combining the language model and the acoustic model by adopting a compound algorithm to obtain a main decoding network.
In the embodiment of the present application, the output tag on a certain branch of the first WFST can be equal to the input tag on a certain branch of the second WFST through the composition algorithm, and then label and weight on the branches are respectively operated. The specific implementation code of the composition algorithm is not described in detail in this application.
In one embodiment, as shown in fig. 5, the generation process of the sub-decoding network includes:
and step 520, collecting the target named entities in the scene to be identified to form a target named entity text.
The target named entities in the scene to be recognized comprise named entities, wherein some voice recognition error rates specified in the scene to be recognized by a person exceed a preset error rate threshold. In addition, professional vocabularies in a scene to be recognized can be used as the target named entity text, for example, for a medical scene, professional vocabularies such as doctors, patients, blood pressure, heartbeat, and ct (computed tomogry) can be used as the target named entity text. And for the electric competition scene, professional vocabularies such as chicken eating and MVP are used as the target named entity text. Obviously, the target named entities in different application scenarios vary widely.
And respectively collecting target named entities in different application scenes to form target named entity texts in all application scenes.
Step 540, assigning a language model score to the target named entity text.
The language model may be trained by using a Recurrent Neural Network, and may also be referred to as RNNLM (Recurrent Neural Network Based language model). The loop network language model may consider a plurality of words input before in addition to the currently input word, and may calculate the probability of the next word occurring according to a long text composed of the words input before, so the loop network language model has a "better memory effect". For example, "good" may appear after "my" and "mood" and "bad" may appear, and the appearance of these words depends on the previous appearance of "my" and "mood", which is "memory effect".
The target named entity text can be endowed with a language model score manually, and the identification accuracy of the target named entities is improved by endowing the target named entity text with a higher language model score in general. Here, the higher language model score may be a score that exceeds a preset score threshold, for example, the preset score threshold is set to 0.9.
And step 560, combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network.
And combining the target named entity text endowed with the language model score with the acoustic model by adopting a compound algorithm to obtain the sub-decoding network.
In the embodiment of the application, a target named entity text is formed by collecting the target named entity in a scene to be identified, and a language model score is given to the target named entity text. And combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network. The sub-decoding network can be inserted into the empty node of the main decoding network, so that accurate and complete speech recognition can be performed on the speech data through the main decoding network and the sub-decoding network.
In one embodiment, the decoding the acoustic features and the acoustic model scores of the acoustic features to obtain a speech recognition result by using a main decoding network and a sub-decoding network includes:
decoding the acoustic characteristics and the acoustic model scores of the acoustic characteristics by adopting a main decoding network and a sub decoding network to obtain a speech recognition grid lattice;
and obtaining a voice recognition result based on the voice recognition grid lattice.
Specifically, the main decoding network and the sub decoding network are adopted to decode the acoustic features and the acoustic model scores of the acoustic features to obtain the speech recognition grid lattice. Namely, the main decoding network is adopted to decode the acoustic features except the target named entity, and the sub decoding network is adopted to decode the acoustic features corresponding to the target named entity, so as to obtain the speech recognition grid lattice.
The speech recognition lattice comprises a plurality of sequences of candidate words. Wherein, the alternative word sequence comprises a plurality of words and a plurality of paths, which can also be called "lattice", the lattice is essentially a directed acyclic graph (directed acyclic graph), each node on the graph represents an ending time point of a word, each jump edge represents a possible word, and an acoustic model score and a language model score of the occurrence of the word. When the result of the voice recognition is expressed, each node stores the result of the voice recognition at the current position, including information such as acoustic probability, language probability and the like.
FIG. 6 is a diagram illustrating a structure of a speech recognition lattice according to an embodiment. Different word sequences can be obtained by walking from the leftmost initial node to the final node along different arcs, and the probabilities stored on the arcs are combined to represent the probability (score) that a certain segment of characters is obtained by input voice. For example, as shown in fig. 6, "hello shenzhen", "hello beijing", and "hello background" can be regarded as a path of the speech recognition result, i.e., "hello shenzhen", "hello beijing", and "hello background" are word sequences, and these word sequences constitute the speech recognition lattice. And each path in the graph corresponds to a probability, and the score of each path can be calculated according to the probability.
Obviously, the obtained speech recognition grid lattice is relatively large, so that the speech recognition grid lattice can be pruned. One pruning method is to score lattice in the forward and backward directions, calculate the posterior probability of each jump edge, and delete the jump edges with lower posterior probability. Thus, after pruning the speech recognition mesh lattice, the speech recognition mesh lattice is simplified, but the important information in the speech recognition mesh lattice is still retained.
Based on the speech recognition lattice obtained after the pruning processing, a preset number of word sequences with top scores are extracted from the word sequences therein as candidate word sequences. And screening a target word sequence from the candidate word sequence as a voice recognition result. Here, the number of target word sequences is generally one.
In the embodiment of the application, the main decoding network is adopted to decode the acoustic features except the target named entity, and the sub decoding network is adopted to decode the acoustic features corresponding to the target named entity, so as to obtain the speech recognition grid lattice. The obtained speech recognition grid lattice comprises a plurality of alternative word sequences, so that the speech recognition grid lattice is pruned firstly, then the speech recognition grid lattice after pruning is screened, and finally a target word sequence is screened out to be used as a speech recognition result. Aiming at the target named entity in the scene to be identified, the target named entity can be accurately decoded based on the sub-decoding network. And pruning the speech recognition grid lattice decoded by the main decoding network and the sub decoding network and screening out a target word sequence as a speech recognition result. Therefore, the efficiency and accuracy of speech recognition are improved.
In one embodiment, the speech recognition lattice comprises a plurality of word sequences, wherein each word sequence comprises a node and a skip edge, and the skip edge carries word information of acoustic characteristics;
as shown in fig. 7, the method for decoding the acoustic features and the acoustic model scores of the acoustic features by using the main decoding network and the sub decoding network to obtain the speech recognition lattice includes:
and 720, sequentially acquiring the word sequences corresponding to the acoustic features in the voice data from the main decoding network.
Fig. 8 is a schematic diagram of a decoding process in an embodiment. And extracting acoustic features of the voice data to be processed. The extracted acoustic features are input into an acoustic model, and an acoustic model score of the acoustic features is calculated. The acoustic features include phonemes, but the present application is not limited thereto. For example, a segment of audio signal is received, and the acoustic features sequentially extracted from the segment of audio signal are eight phonemes n, i, h, ao, sh, en, zh, and en, and the word sequences corresponding to the eight phonemes are sequentially obtained from the main decoding network.
As shown in fig. 8, the word sequences corresponding to the eight phonemes are obtained from the main decoding network in the process of sequentially acquiring the word sequences, where node 1 is a start node, nodes 2, 3, 4, and 5 are intermediate nodes, and node 6 is a stop node. A jump edge is connected between the starting node and the terminating node, and word information and phoneme information are carried on the jump edge. Wherein, the skip edge between the node 1 and the node 2 carries word information: hello; the phoneme information is carried as follows: n is the same as the formula (I). The skip edge between node 2 and node 3 carries word information: blank; the phoneme information is carried as follows: i. the skip edge between node 3 and node 4 carries word information: blank; the phoneme information is carried as follows: h. the skip edge between node 4 and node 5 carries word information: blank; the phoneme information is carried as follows: ao (a). The node 5 is a hollow node obtained by hollowing out the named entity, and therefore, the skip edge between the node 5 and the node 6 does not carry word information nor phoneme information.
And step 740, if the word information on the jumping edge of the middle node of the word sequence is null, calling the sub-decoding network, and acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the sub-decoding network.
In the process of sequentially acquiring the word sequences corresponding to the acoustic features from the main decoding network, if word information on a jump edge of a middle node of the word sequences is empty, the named entity at the middle node is hollowed. Therefore, if the word information on the jumping edge of the middle node of the word sequence is null, the sub-decoding network is called, and the word sequence corresponding to the next acoustic feature in the audio signal is obtained from the sub-decoding network.
As shown in fig. 8, no word information is carried between the jumping edges of the node 5, and no phoneme information is carried, i.e. the word information on the jumping edges of the node 5 is empty. At this time, the sub-decoding network is called, and the word sequence corresponding to the next acoustic feature in the audio signal is obtained from the sub-decoding network. That is, the next acoustic feature sh in the audio signal is obtained from the sub-decoding network, and the word sequences corresponding to en, zh, and en are obtained in sequence.
And 760, returning to the main coding network until a termination node of the word sequence is reached in the sub-decoding network, and continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until the word sequences of all the acoustic features in the audio signal to be processed are sequentially acquired.
Until a termination node of the word sequence is reached in the sub-decoding network, a word sequence is understood to be completely recognized. For example, as can be seen from fig. 8, the word sequence corresponding to several factors sh, en, zh, en completely identified in the sub-decoding network is "shenzhen", that is, the termination node that has reached the word sequence. Of course, the word sequences corresponding to several factors sh, en, zh, en may also be identified in the sub-decoding network as multiple alternative word sequences such as "magic needle", "nystag", and the like. At this point, it returns to the terminating node 6 of the main decoding network. And then, continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until the word sequences of all the acoustic features in the audio signal to be processed are sequentially acquired. For example, after the audio signal of "hello shenzhen", there is an audio signal of "i likes you", the next acoustic feature wo in the audio signal is continuously obtained from the main decoding network, and word sequences corresponding to x, i, h, u, an, n, i are sequentially obtained until word sequences of all acoustic features in the audio signal to be processed are sequentially obtained.
Step 780, connecting all the word sequences of the acoustic features in sequence to form a speech recognition grid lattice.
And connecting the word sequences of all acoustic features in the voice data in sequence to form a voice recognition grid lattice. For example, the word sequences of the speech data in fig. 8 are connected in sequence, so that a plurality of alternative word sequences such as "hello Shenzhen", and "hello Shenzhen" can be obtained. These sequences of candidate words constitute the speech recognition lattice.
In the embodiment of the application, the word sequences corresponding to the acoustic features in the voice data are sequentially acquired from the main decoding network. And if the word information on the jumping edge of the middle node of the word sequence is empty, calling the sub-decoding network, acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the sub-decoding network until the end node of the word sequence is reached in the sub-decoding network, returning to the main encoding network, and continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until the word sequences of all the acoustic features in the audio signal to be processed are sequentially acquired. And finally, sequentially connecting the word sequences of all the acoustic features to form a speech recognition grid lattice.
In the process of decoding by adopting the main decoding network, if word information of a skip edge of a middle node of a word sequence is empty, the sub-decoding network is called to decode. And finally, obtaining the speech recognition grid lattice based on the alternative word sequence obtained by decoding through the main decoding network and the sub decoding network. The target named entity is accurately identified through the sub-decoding network.
In one embodiment, as shown in FIG. 9, the jump edge also carries the language model score of the acoustic feature; obtaining a voice recognition result based on the voice recognition grid lattice, comprising:
step 920, acquiring the total score of each word sequence in the speech recognition grid lattice based on the acoustic model score of the acoustic characteristics and the language model score of the acoustic characteristics;
step 940, the word sequence with the highest total score of the word sequences is obtained and used as the target word sequence;
step 960, obtaining word information of the target word sequence, and using the word information in the target word sequence as a voice recognition result.
The speech recognition lattice comprises a plurality of sequences of candidate words. Wherein, the word sequence includes a plurality of words and a plurality of paths, which may also be called "lattice", which is essentially a directed acyclic graph (directed acyclic graph), each node on the graph represents an ending time point of a word, each jump edge represents a possible word, and an acoustic model score and a language model score of the occurrence of the word.
And for each word sequence, summing the acoustic model score and the language model score on each jump edge to obtain the total score of each word sequence in the speech recognition grid lattice. And then obtaining the word sequence with the highest total score of the word sequence as a target word sequence. And taking the word information of the target word sequence as a voice recognition result.
And shown in combination with fig. 8, if the total score of the word sequence of "hello shenzhen" is the highest, then the word information of the word sequence, namely "hello shenzhen", is used as the speech recognition result.
In the embodiment of the application, for each word sequence, the acoustic model score and the language model score on each jump edge are summed to obtain the total score of each word sequence in the speech recognition grid lattice. And then obtaining the word sequence with the highest total score of the word sequence as a target word sequence. And taking the word information of the target word sequence as a voice recognition result. Therefore, the target word sequence can be accurately screened out through the score of the acoustic model and the score of the language model.
In one embodiment, the scene to be identified comprises at least one of a medical scene, an image processing scene, and a bidding scene.
For a medical scenario, most of the target named entities in the application scenario are professional vocabularies, for example, the named entities are doctors, patients, blood pressure, heart beat, ct (computed tomogry), and the like. For the image processing scene, the target named entities are resolution, chromatic aberration, backlight, and the like. And for the electric competition scene, the target named entities are chicken eating, MVP and the like. Obviously, the target named entities in different application scenarios vary widely.
In the embodiment of the application, for the different application scenes including, but not limited to, a medical scene, an image processing scene, and a power competition scene, the target named entities in the different application scenes are respectively collected to form the target named entity text in each application scene. A language model score is then assigned to the target named entity text. And finally, combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network. Therefore, the target named entities under different application scenes can be subjected to targeted voice recognition, and the accuracy of the finally obtained voice recognition result is improved.
In one embodiment, as shown in fig. 10, a speech recognition apparatus 1000 includes:
an acoustic feature extraction module 1020, configured to perform acoustic feature extraction on the voice data to be processed;
an acoustic model score calculation module 1040, configured to input the extracted acoustic features into an acoustic model, and calculate an acoustic model score of the acoustic features;
the decoding module 1060 is configured to decode the acoustic features and the acoustic model scores of the acoustic features to obtain a speech recognition result by using a main decoding network and a sub-decoding network; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized.
In one embodiment, as shown in fig. 11, there is also provided a speech recognition apparatus 1000, further comprising: the main decoding network generation module 1070 is configured to perform hollowing processing on named entities in the original text corpus to obtain a target text corpus; training a target text training corpus to obtain a language model; training a voice training corpus corresponding to the original text training corpus to obtain an acoustic model; and combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus.
In one embodiment, the main decoding network generating module 1070 is further configured to combine the language model and the acoustic model by using a composition algorithm to obtain the main decoding network.
In one embodiment, as shown in fig. 11, there is also provided a speech recognition apparatus 1000, further comprising: the sub-decoding network generation module 1080 is used for acquiring a target named entity in a scene to be identified to form a target named entity text; assigning a language model score to the target named entity text; and combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network.
In one embodiment, the decoding module 1060 further includes:
the voice recognition grid generating unit is used for decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition grid;
and the voice recognition result determining unit is used for obtaining a voice recognition result based on the voice recognition grid lattice.
In one embodiment, the speech recognition lattice comprises a plurality of word sequences, wherein each word sequence comprises a node and a skip edge, and the skip edge carries word information of acoustic characteristics; the voice recognition grid generating unit is also used for sequentially acquiring word sequences corresponding to all acoustic features in the voice data from the main decoding network; if the word information on the jumping edge of the middle node of the word sequence is empty, calling a sub-decoding network, and acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the sub-decoding network; returning to the main coding network until a termination node of the word sequence is reached in the sub-decoding network, and continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main coding network until the word sequences of all the acoustic features in the audio signal to be processed are sequentially acquired; and connecting the word sequences of all the acoustic features in sequence to form a speech recognition grid lattice.
In one embodiment, the jump edge also carries a language model score for the acoustic feature; the voice recognition result determining unit is also used for acquiring the total score of each word sequence in the voice recognition grid lattice based on the acoustic model score of the acoustic characteristics and the language model score of the acoustic characteristics; acquiring a word sequence with the highest total score of the word sequence as a target word sequence; and acquiring word information of the target word sequence, and taking the word information in the target word sequence as a voice recognition result.
In one embodiment, the scene to be identified comprises at least one of a medical scene, an image processing scene, and a bidding scene.
The division of the modules in the speech recognition apparatus is only for illustration, and in other embodiments, the speech recognition apparatus may be divided into different modules as needed to complete all or part of the functions of the speech recognition apparatus.
Fig. 12 is a schematic diagram of the internal structure of the server in one embodiment. As shown in fig. 12, the server includes a processor and a memory connected by a system bus. Wherein, the processor is used for providing calculation and control capability and supporting the operation of the whole server. The memory may include a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The computer program can be executed by a processor for implementing a speech recognition method provided in the following embodiments. The internal memory provides a cached execution environment for the operating system computer programs in the non-volatile storage medium. The server may be a mobile phone, a tablet computer, or a personal digital assistant or a wearable device, etc.
The implementation of each module in the speech recognition apparatus provided in the embodiments of the present application may be in the form of a computer program. The computer program may be run on a terminal or a server. The program modules constituted by the computer program may be stored on the memory of the terminal or the server. Which when executed by a processor, performs the steps of the method described in the embodiments of the present application.
The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the speech recognition method.
A computer program product comprising instructions which, when run on a computer, cause the computer to perform a speech recognition method.
Any reference to memory, storage, database, or other medium used by embodiments of the present application may include non-volatile and/or volatile memory. Suitable non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (12)

1. A speech recognition method, comprising:
extracting acoustic features of the voice data to be processed;
inputting the extracted acoustic features into an acoustic model, and calculating an acoustic model score of the acoustic features;
decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized.
2. The method of claim 1, wherein the generation process of the primary decoding network comprises:
hollowing out the named entities in the original text training corpus to obtain a target text training corpus;
training the target text training corpus to obtain a language model;
training the voice training corpus corresponding to the original text training corpus to obtain an acoustic model;
and combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus.
3. The method of claim 2, wherein combining the language model with the acoustic model to obtain a primary decoding network comprises:
and combining the language model and the acoustic model by adopting a compound algorithm to obtain a main decoding network.
4. The method of claim 2, wherein the generation process of the sub-decoding network comprises:
acquiring a target named entity in a scene to be identified to form a target named entity text;
assigning a language model score to the target named entity text;
and combining the target named entity text endowed with the language model score with the acoustic model to obtain a sub-decoding network.
5. The method of claim 1, wherein the decoding the acoustic features and the acoustic model scores of the acoustic features using a main decoding network and a sub decoding network to obtain a speech recognition result comprises:
decoding the acoustic features and the acoustic model scores of the acoustic features by adopting the main decoding network and the sub decoding networks to obtain a speech recognition grid lattice;
and obtaining a voice recognition result based on the voice recognition grid lattice.
6. The method of claim 5, wherein the speech recognition lattice comprises a plurality of word sequences, the word sequences comprising nodes and jump edges, the jump edges carrying word information of the acoustic features;
the decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a speech recognition grid lattice, and the method comprises the following steps:
sequentially acquiring word sequences corresponding to the acoustic features in the voice data from the main decoding network;
if the word information on the jumping edge of the middle node of the word sequence is empty, calling the sub-decoding network, and acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the sub-decoding network;
returning to the main coding network until a termination node of the word sequence is reached in the sub-decoding network, and continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until the word sequences of all the acoustic features in the audio signal to be processed are sequentially acquired;
and connecting the word sequences of all the acoustic features in sequence to form a speech recognition grid lattice.
7. The method of claim 6, wherein the jump edge also carries a language model score for the acoustic feature; obtaining a voice recognition result based on the voice recognition grid lattice, wherein the method comprises the following steps:
acquiring the total score of each word sequence in the speech recognition grid lattice based on the acoustic model score of the acoustic characteristics and the language model score of the acoustic characteristics;
acquiring a word sequence with the highest total score of the word sequence as a target word sequence;
and acquiring word information of the target word sequence, and taking the word information in the target word sequence as a voice recognition result.
8. The method of claim 4, wherein the scene to be identified comprises at least one of a medical scene, an image processing scene, and a telestration scene.
9. A speech recognition apparatus, characterized in that the apparatus comprises:
the acoustic feature extraction module is used for extracting acoustic features of the voice data to be processed;
an acoustic model score calculation module, configured to input the extracted acoustic features into an acoustic model, and calculate an acoustic model score of the acoustic features;
the decoding module is used for decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized.
10. The apparatus of claim 9, further comprising: the main decoding network generation module is used for hollowing out named entities in the original text training corpus to obtain a target text training corpus; training the target text training corpus to obtain a language model; training the voice training corpus corresponding to the original text training corpus to obtain an acoustic model; and combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus.
11. A server comprising a memory and a processor, the memory having stored thereon a computer program, wherein the computer program, when executed by the processor, causes the processor to perform the steps of the speech recognition method according to any of claims 1 to 8.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 8.
CN202011607655.7A 2020-12-30 2020-12-30 Speech recognition method and device, server and computer readable storage medium Active CN112802461B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011607655.7A CN112802461B (en) 2020-12-30 2020-12-30 Speech recognition method and device, server and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011607655.7A CN112802461B (en) 2020-12-30 2020-12-30 Speech recognition method and device, server and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112802461A true CN112802461A (en) 2021-05-14
CN112802461B CN112802461B (en) 2023-10-24

Family

ID=75804372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011607655.7A Active CN112802461B (en) 2020-12-30 2020-12-30 Speech recognition method and device, server and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112802461B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327587A (en) * 2021-06-02 2021-08-31 云知声(上海)智能科技有限公司 Method and device for voice recognition in specific scene, electronic equipment and storage medium
CN113327597A (en) * 2021-06-23 2021-08-31 网易(杭州)网络有限公司 Speech recognition method, medium, device and computing equipment
CN114898754A (en) * 2022-07-07 2022-08-12 北京百度网讯科技有限公司 Decoding graph generation method, decoding graph generation device, speech recognition method, speech recognition device, electronic equipment and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1499484A (en) * 2002-11-06 2004-05-26 北京天朗语音科技有限公司 Recognition system of Chinese continuous speech
EP1981020A1 (en) * 2007-04-12 2008-10-15 France Télécom Method and system for automatic speech recognition adapted for detecting utterances out of context
US20140278390A1 (en) * 2013-03-12 2014-09-18 International Business Machines Corporation Classifier-based system combination for spoken term detection
CN106294460A (en) * 2015-05-29 2017-01-04 中国科学院声学研究所 A kind of Chinese speech keyword retrieval method based on word and word Hybrid language model
US20180061396A1 (en) * 2016-08-24 2018-03-01 Knowles Electronics, Llc Methods and systems for keyword detection using keyword repetitions
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN108615525A (en) * 2016-12-09 2018-10-02 ***通信有限公司研究院 A kind of audio recognition method and device
CN108711422A (en) * 2018-05-14 2018-10-26 腾讯科技(深圳)有限公司 Audio recognition method, device, computer readable storage medium and computer equipment
CN110634474A (en) * 2019-09-24 2019-12-31 腾讯科技(深圳)有限公司 Speech recognition method and device based on artificial intelligence
CN110942763A (en) * 2018-09-20 2020-03-31 阿里巴巴集团控股有限公司 Voice recognition method and device
CN111128183A (en) * 2019-12-19 2020-05-08 北京搜狗科技发展有限公司 Speech recognition method, apparatus and medium
CN111916058A (en) * 2020-06-24 2020-11-10 西安交通大学 Voice recognition method and system based on incremental word graph re-scoring
CN112102815A (en) * 2020-11-13 2020-12-18 深圳追一科技有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1499484A (en) * 2002-11-06 2004-05-26 北京天朗语音科技有限公司 Recognition system of Chinese continuous speech
EP1981020A1 (en) * 2007-04-12 2008-10-15 France Télécom Method and system for automatic speech recognition adapted for detecting utterances out of context
US20140278390A1 (en) * 2013-03-12 2014-09-18 International Business Machines Corporation Classifier-based system combination for spoken term detection
CN106294460A (en) * 2015-05-29 2017-01-04 中国科学院声学研究所 A kind of Chinese speech keyword retrieval method based on word and word Hybrid language model
US20180061396A1 (en) * 2016-08-24 2018-03-01 Knowles Electronics, Llc Methods and systems for keyword detection using keyword repetitions
CN108615525A (en) * 2016-12-09 2018-10-02 ***通信有限公司研究院 A kind of audio recognition method and device
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN108711422A (en) * 2018-05-14 2018-10-26 腾讯科技(深圳)有限公司 Audio recognition method, device, computer readable storage medium and computer equipment
CN110942763A (en) * 2018-09-20 2020-03-31 阿里巴巴集团控股有限公司 Voice recognition method and device
CN110634474A (en) * 2019-09-24 2019-12-31 腾讯科技(深圳)有限公司 Speech recognition method and device based on artificial intelligence
CN111128183A (en) * 2019-12-19 2020-05-08 北京搜狗科技发展有限公司 Speech recognition method, apparatus and medium
CN111916058A (en) * 2020-06-24 2020-11-10 西安交通大学 Voice recognition method and system based on incremental word graph re-scoring
CN112102815A (en) * 2020-11-13 2020-12-18 深圳追一科技有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327587A (en) * 2021-06-02 2021-08-31 云知声(上海)智能科技有限公司 Method and device for voice recognition in specific scene, electronic equipment and storage medium
CN113327597A (en) * 2021-06-23 2021-08-31 网易(杭州)网络有限公司 Speech recognition method, medium, device and computing equipment
CN113327597B (en) * 2021-06-23 2023-08-22 网易(杭州)网络有限公司 Speech recognition method, medium, device and computing equipment
CN114898754A (en) * 2022-07-07 2022-08-12 北京百度网讯科技有限公司 Decoding graph generation method, decoding graph generation device, speech recognition method, speech recognition device, electronic equipment and storage medium
CN114898754B (en) * 2022-07-07 2022-09-30 北京百度网讯科技有限公司 Decoding image generation method, decoding image generation device, speech recognition method, speech recognition device, electronic device and storage medium

Also Published As

Publication number Publication date
CN112802461B (en) 2023-10-24

Similar Documents

Publication Publication Date Title
Pepino et al. Emotion recognition from speech using wav2vec 2.0 embeddings
CN110287283B (en) Intention model training method, intention recognition method, device, equipment and medium
CN108831439B (en) Voice recognition method, device, equipment and system
CN112802461B (en) Speech recognition method and device, server and computer readable storage medium
CN108899013B (en) Voice search method and device and voice recognition system
KR102413693B1 (en) Speech recognition apparatus and method, Model generation apparatus and method for Speech recognition apparatus
JP4195428B2 (en) Speech recognition using multiple speech features
CN112712813B (en) Voice processing method, device, equipment and storage medium
US20220262352A1 (en) Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation
CN112102815A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN113436612B (en) Intention recognition method, device, equipment and storage medium based on voice data
Das et al. Best of both worlds: Robust accented speech recognition with adversarial transfer learning
CN112614510B (en) Audio quality assessment method and device
CN112270184B (en) Natural language processing method, device and storage medium
CN113450757A (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
TWI752406B (en) Speech recognition method, speech recognition device, electronic equipment, computer-readable storage medium and computer program product
CN112151020B (en) Speech recognition method, device, electronic equipment and storage medium
CN113506586B (en) Method and system for identifying emotion of user
WO2021229643A1 (en) Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program
CN116469375A (en) End-to-end speech synthesis method, device, equipment and medium
CN117012177A (en) Speech synthesis method, electronic device, and storage medium
CN115762574A (en) Voice-based action generation method and device, electronic equipment and storage medium
Deng et al. History utterance embedding transformer lm for speech recognition
CN112397053B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN115359780A (en) Speech synthesis method, apparatus, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant