CN112802461A

CN112802461A - Speech recognition method and device, server, computer readable storage medium

Info

Publication number: CN112802461A
Application number: CN202011607655.7A
Authority: CN
Inventors: 周维聪; 袁丁; 赵金昊; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-05-14
Anticipated expiration: 2040-12-30
Also published as: CN112802461B

Abstract

The application relates to a voice recognition method and device, a server and a computer readable storage medium, comprising: and extracting acoustic features of the voice data to be processed, inputting the extracted acoustic features into an acoustic model, and calculating the score of the acoustic model of the acoustic features. And decoding the acoustic characteristics and the acoustic model scores of the acoustic characteristics by adopting a main decoding network and a sub decoding network to obtain a voice recognition result. According to the voice recognition method, a decoding network is not retrained in a scene to be recognized, but a target named entity in the scene to be recognized is trained to obtain a sub-decoding graph, and then a main decoding network and the sub-decoding network are adopted to decode to obtain a voice recognition result. Therefore, the target named entity in the scene to be identified can be accurately decoded based on the sub-decoding network. And because the decoding network is not retrained for the scene to be recognized, the training time is greatly shortened, and the speech recognition efficiency is improved.

Description

Speech recognition method and device, server, computer readable storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a speech recognition method and apparatus, a server, and a computer-readable storage medium.

Background

With the continuous development of artificial intelligence and natural language processing technology, speech recognition technology has also been rapidly developed. The audio signal can be automatically converted into corresponding text or commands using speech recognition techniques. The traditional voice recognition technology can be applied to common and daily voice recognition scenes and obtains a good recognition effect.

However, when the method is applied to a professional scene, the recognition effect is poor due to the fact that a large number of professional vocabularies are contained in the professional scene. If the decoding network is retrained specially for the professional scene for voice recognition, obviously, the work load of retraining the decoding graph is large, the training time is long, and the method cannot be realized quickly.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, a server and a computer readable storage medium, which can reduce the workload of retraining a decoding graph, greatly shorten the training time and improve the efficiency of voice recognition when voice recognition is carried out in a specific application scene.

A speech recognition method comprising:

extracting acoustic features of the voice data to be processed;

inputting the extracted acoustic features into an acoustic model, and calculating an acoustic model score of the acoustic features;

decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized.

A speech recognition apparatus, the apparatus comprising:

the acoustic feature extraction module is used for extracting acoustic features of the voice data to be processed;

an acoustic model score calculation module, configured to input the extracted acoustic features into an acoustic model, and calculate an acoustic model score of the acoustic features;

the decoding module is used for decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized.

A server comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the above method.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as above.

The voice recognition method, the voice recognition device, the server and the computer-readable storage medium extract the acoustic features of the voice data to be processed, input the extracted acoustic features into the acoustic model and calculate the score of the acoustic model of the acoustic features. And decoding the acoustic characteristics and the acoustic model scores of the acoustic characteristics by adopting a main decoding network and a sub decoding network to obtain a voice recognition result. The sub-decoding network is a decoding network obtained by training the target named entity in the scene to be recognized. According to the voice recognition method, a decoding network is not retrained in a scene to be recognized, but a target named entity in the scene to be recognized is trained to obtain a sub-decoding network, and then a main decoding network and the sub-decoding network are adopted to decode acoustic features and acoustic model scores of the acoustic features to obtain a voice recognition result. Therefore, the target named entity in the scene to be identified can be accurately decoded based on the sub-decoding network. And because the decoding network is not retrained for the scene to be recognized, the training time is greatly shortened, and the speech recognition efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram illustrating an exemplary implementation of a speech recognition method;

FIG. 2 is a flow diagram of a method of speech recognition in one embodiment;

FIG. 3 is a flow diagram of a method for generating a primary decoding network in one embodiment;

FIG. 4 is a schematic diagram of a portion of a primary decoding network in one embodiment;

FIG. 5 is a flow diagram of a method for generating a sub-decoding network in one embodiment;

FIG. 6 is a diagram illustrating the structure of a speech recognition lattice according to an embodiment;

FIG. 7 is a flowchart illustrating a method for obtaining lattice parameters of speech recognition by using a main decoding network and a sub decoding network for decoding according to an embodiment;

FIG. 8 is a diagram illustrating an embodiment of decoding with a main decoding network and a sub-decoding network to obtain a speech recognition lattice;

FIG. 9 is a flowchart of a method for obtaining speech recognition results based on speech recognition lattice in one embodiment;

FIG. 10 is a block diagram showing the structure of a speech recognition apparatus according to an embodiment;

FIG. 11 is a block diagram showing the construction of a speech recognition apparatus according to another embodiment;

fig. 12 is a schematic diagram of the internal structure of the server in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.

Fig. 1 is a diagram illustrating an application scenario of the speech recognition method according to an embodiment. As shown in fig. 1, the application environment includes a terminal 120 and a server 140, and the terminal 120 and the server 140 are connected through a network. The server 140 performs acoustic feature extraction on the voice data to be processed by using the voice recognition method in the application; inputting the extracted acoustic features into an acoustic model, and calculating an acoustic model score of the acoustic features; decoding the acoustic characteristics and the acoustic model scores of the acoustic characteristics by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized. Here, the terminal 120 may be any terminal device such as a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), an in-vehicle computer, a wearable device, and a smart home.

Fig. 2 is a flowchart of a speech recognition method in an embodiment, and as shown in fig. 2, a speech recognition method is provided and applied to a server, and includes steps 220 to 260. Wherein the content of the first and second substances,

step 220, extracting acoustic features of the voice data to be processed.

The voice data may refer to an acquired audio signal. Specifically, the audio signal may be an audio signal acquired in a voice input scene, an intelligent chat scene, or a voice translation scene. And extracting acoustic features of the voice data to be processed. The specific process of acoustic feature extraction may be: and converting the acquired one-dimensional audio signals into a group of high-dimensional vectors through a feature extraction algorithm. The obtained high-dimensional vector is an acoustic feature, and common acoustic features include MFCC, Fbank, vector, and the like, which are not limited in the present application. Fbank (filterbank) is a front-end processing algorithm, which processes audio in a manner similar to human ears, and can improve the performance of speech recognition. The general steps for obtaining the Fbank characteristics of the voice signal are as follows: pre-emphasis, framing, windowing, short-time fourier transform (STFT), mel-filtering, de-averaging, etc. And the MFCC characteristics can be obtained by performing Discrete Cosine Transform (DCT) on the Fbank.

The Mel-frequency cepstral coefficients are extracted based on the auditory characteristics of human ears, and therefore, the Mel frequency and the Hz frequency form a nonlinear correspondence. The mel-frequency cepstrum coefficient (MFCC) is the Hz frequency spectrum feature calculated by utilizing the nonlinear corresponding relation between the mel frequency and the Hz frequency. MFCC is mainly used for speech data feature extraction and reduces the operational dimension. For example: for a frame with 512-dimensional (sampling point) data, the most important 40-dimensional data can be extracted after MFCC, thereby achieving the purpose of reducing dimensions. Wherein, vector is the feature vector describing each speaker.

Step 240, inputting the extracted acoustic features into an acoustic model, and calculating an acoustic model score of the acoustic features.

Specifically, the acoustic model may include a neural network model and a hidden markov model, where the neural network model may provide acoustic modeling units to the hidden markov model, and the granularity of the acoustic modeling units may include: words, syllables, phonemes, or states, etc. The hidden Markov model can determine the phoneme sequence according to an acoustic modeling unit provided by the neural network model. A state mathematically characterizes the state of a markov process. And the acoustic model is a model obtained by training in advance according to the audio training corpus.

The extracted acoustic features are input into an acoustic model, and an acoustic model score of the acoustic features can be calculated. Here, the acoustic model score may be regarded as a score calculated according to a probability of occurrence of each phoneme under each acoustic feature.

Step 260, decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized.

The decoding network is used for finding the best decoding path under the condition of giving the phoneme sequence, and then the voice recognition result can be obtained. In the embodiment of the application, the adopted decoding networks include a main decoding network and a sub-decoding network, the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized. In this way, the phoneme sequence of the named entity can be decoded by using the main decoding network, and the phoneme sequence of the target named entity can be decoded by using the sub-decoding network. Therefore, the main decoding network and the sub decoding network are adopted to decode the acoustic features and the acoustic model scores of the acoustic features, and then the voice recognition result is obtained.

The target named entities in the scene to be recognized include named entities with voice recognition error rates exceeding a preset error rate threshold value and professional vocabularies in the scene to be recognized, which are specified manually in the scene to be recognized.

In the embodiment of the application, aiming at the technical problem of low accuracy of voice recognition in a specific application scene, a voice recognition method is provided, acoustic features of voice data to be processed are extracted, the extracted acoustic features are input into an acoustic model, and the score of the acoustic model of the acoustic features is calculated. And decoding the acoustic characteristics and the acoustic model scores of the acoustic characteristics by adopting a main decoding network and a sub decoding network to obtain a voice recognition result. According to the voice recognition method, a decoding network is not retrained for a scene to be recognized, but a target named entity in the scene to be recognized is trained to obtain a sub-decoding graph, and then a main decoding network and the sub-decoding network are adopted to decode acoustic features and acoustic model scores of the acoustic features to obtain a voice recognition result. Therefore, the target named entity in the scene to be identified can be accurately decoded based on the sub-decoding network. And because the decoding network is not retrained for the scene to be recognized, the training time is greatly shortened, and the speech recognition efficiency is improved.

In one embodiment, as shown in fig. 3, the generation process of the main decoding network includes:

and 320, hollowing out the named entities in the original text training corpus to obtain the target text training corpus.

The obtaining of the original text corpus may be obtaining the original text corpus from a corpus. The original text corpus approximately contains 700-. The language material base is used for storing the language material which is actually appeared in the practical use of the language, and the language material base is used for bearing the basic resource of language knowledge by taking an electronic computer as a carrier. Typically, the actual corpus needs to be processed (e.g., analyzed and processed) to become a useful resource. Named entity (named entity), as the name implies, named entity is the name of a person, organization, place, and all other entities identified by name, and the broader entities include numbers, dates, currencies, addresses, and the like.

Because named entities in different application scenarios are greatly different and may not be corpora contained in the original text corpus. Therefore, in order to improve the accuracy of speech recognition in a specific application scenario, a named entity in an original text corpus may be hollowed out, and the hollowed-out position is represented by a hollow node. Thus, the corpus not containing the named entities is left, and the corpus not containing the named entities constitutes the target text corpus.

And 340, training the target text training corpus to obtain a language model.

And after the target text training corpus is obtained, training the target text training corpus to obtain a language model. The language model may be trained by using a Recurrent Neural Network, and may also be referred to as RNNLM (Recurrent Neural Network Based language model). The loop network language model may consider a plurality of words input before in addition to the currently input word, and may calculate the probability of the next word occurring according to a long text composed of the words input before, so the loop network language model has a "better memory effect". For example, "good" may appear after "my" and "mood" and "bad" may appear, and the appearance of these words depends on the previous appearance of "my" and "mood", which is "memory effect".

And step 360, training the voice training corpus corresponding to the original text training corpus to obtain the acoustic model.

And then, acquiring a voice training corpus corresponding to the original text training corpus, wherein the voice training corpus and each corpus in the original text training corpus have a corresponding relation. That is, each text corpus in the original text corpus can obtain the corresponding speech corpus in the speech corpus.

The acoustic model uses a deep neural network model to model the mapping between acoustic pronunciations and basic acoustic units (typically phonemes). Wherein, the phoneme is the minimum voice unit divided according to the natural attribute of the voice. The acoustic model may receive the input acoustic features and output a phoneme sequence corresponding to the acoustic features. And extracting acoustic features aiming at the voice corpora in the voice database, and training an acoustic model according to the extracted acoustic features so as to obtain the acoustic model.

And 380, combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus.

The named entities in the original text training corpus are hollowed to obtain a target text training corpus, and the target text training corpus is trained to obtain a language model. And training the voice training corpus corresponding to the original text training corpus to obtain the acoustic model. Therefore, combining the language model with the acoustic model can obtain the main decoding network. The main decoding network comprises nodes corresponding to the training corpora except the named entities, the named entities are represented by adopting empty nodes, and the empty nodes correspond to hollowed positions in the target text training corpora, namely the positions corresponding to the named entities.

Fig. 4 is a schematic diagram of a partial structure of a main decoding network in one embodiment. Wherein, a word sequence comprises nodes and jump edges, wherein the nodes comprise a starting node, a middle node and a terminating node. As shown in fig. 4, node 1 is the start node,

nodes

2, 3, 4, and 5 are intermediate nodes, and node 6 is the end node. A jump edge is connected between the starting node and the terminating node, and word information and phoneme information are carried on the jump edge. Wherein, the skip edge between the node 1 and the node 2 carries word information: hello; the phoneme information is carried as follows: n is the same as the formula (I). The skip edge between node 2 and node 3 carries word information: blank; the phoneme information is carried as follows: i. the skip edge between node 3 and node 4 carries word information: blank; the phoneme information is carried as follows: h. the skip edge between node 4 and node 5 carries word information: blank; the phoneme information is carried as follows: ao (a). The node 5 is a hollow node obtained by hollowing out the named entity, and therefore, the skip edge between the node 5 and the node 6 does not carry word information nor phoneme information.

In the embodiment of the application, the named entities in the original text training corpus are hollowed to obtain a target text training corpus, and the target text training corpus is trained to obtain a language model. And training the voice training corpus corresponding to the original text training corpus to obtain an acoustic model. And combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus. The main decoding network obtained after the named entities are hollowed can be combined with the sub-decoding network obtained by training the target named entities in any specific scene, so that the method is suitable for voice recognition in any specific scene, and the accuracy and efficiency of the voice recognition in any specific scene are improved.

In one embodiment, combining the language model with the acoustic model to obtain a main decoding network comprises:

and combining the language model and the acoustic model by adopting a compound algorithm to obtain a main decoding network.

In the embodiment of the present application, the output tag on a certain branch of the first WFST can be equal to the input tag on a certain branch of the second WFST through the composition algorithm, and then label and weight on the branches are respectively operated. The specific implementation code of the composition algorithm is not described in detail in this application.

In one embodiment, as shown in fig. 5, the generation process of the sub-decoding network includes:

and step 520, collecting the target named entities in the scene to be identified to form a target named entity text.

The target named entities in the scene to be recognized comprise named entities, wherein some voice recognition error rates specified in the scene to be recognized by a person exceed a preset error rate threshold. In addition, professional vocabularies in a scene to be recognized can be used as the target named entity text, for example, for a medical scene, professional vocabularies such as doctors, patients, blood pressure, heartbeat, and ct (computed tomogry) can be used as the target named entity text. And for the electric competition scene, professional vocabularies such as chicken eating and MVP are used as the target named entity text. Obviously, the target named entities in different application scenarios vary widely.

And respectively collecting target named entities in different application scenes to form target named entity texts in all application scenes.

Step 540, assigning a language model score to the target named entity text.

The language model may be trained by using a Recurrent Neural Network, and may also be referred to as RNNLM (Recurrent Neural Network Based language model). The loop network language model may consider a plurality of words input before in addition to the currently input word, and may calculate the probability of the next word occurring according to a long text composed of the words input before, so the loop network language model has a "better memory effect". For example, "good" may appear after "my" and "mood" and "bad" may appear, and the appearance of these words depends on the previous appearance of "my" and "mood", which is "memory effect".

The target named entity text can be endowed with a language model score manually, and the identification accuracy of the target named entities is improved by endowing the target named entity text with a higher language model score in general. Here, the higher language model score may be a score that exceeds a preset score threshold, for example, the preset score threshold is set to 0.9.

And step 560, combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network.

And combining the target named entity text endowed with the language model score with the acoustic model by adopting a compound algorithm to obtain the sub-decoding network.

In the embodiment of the application, a target named entity text is formed by collecting the target named entity in a scene to be identified, and a language model score is given to the target named entity text. And combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network. The sub-decoding network can be inserted into the empty node of the main decoding network, so that accurate and complete speech recognition can be performed on the speech data through the main decoding network and the sub-decoding network.

In one embodiment, the decoding the acoustic features and the acoustic model scores of the acoustic features to obtain a speech recognition result by using a main decoding network and a sub-decoding network includes:

decoding the acoustic characteristics and the acoustic model scores of the acoustic characteristics by adopting a main decoding network and a sub decoding network to obtain a speech recognition grid lattice;

and obtaining a voice recognition result based on the voice recognition grid lattice.

Specifically, the main decoding network and the sub decoding network are adopted to decode the acoustic features and the acoustic model scores of the acoustic features to obtain the speech recognition grid lattice. Namely, the main decoding network is adopted to decode the acoustic features except the target named entity, and the sub decoding network is adopted to decode the acoustic features corresponding to the target named entity, so as to obtain the speech recognition grid lattice.

The speech recognition lattice comprises a plurality of sequences of candidate words. Wherein, the alternative word sequence comprises a plurality of words and a plurality of paths, which can also be called "lattice", the lattice is essentially a directed acyclic graph (directed acyclic graph), each node on the graph represents an ending time point of a word, each jump edge represents a possible word, and an acoustic model score and a language model score of the occurrence of the word. When the result of the voice recognition is expressed, each node stores the result of the voice recognition at the current position, including information such as acoustic probability, language probability and the like.

FIG. 6 is a diagram illustrating a structure of a speech recognition lattice according to an embodiment. Different word sequences can be obtained by walking from the leftmost initial node to the final node along different arcs, and the probabilities stored on the arcs are combined to represent the probability (score) that a certain segment of characters is obtained by input voice. For example, as shown in fig. 6, "hello shenzhen", "hello beijing", and "hello background" can be regarded as a path of the speech recognition result, i.e., "hello shenzhen", "hello beijing", and "hello background" are word sequences, and these word sequences constitute the speech recognition lattice. And each path in the graph corresponds to a probability, and the score of each path can be calculated according to the probability.

Obviously, the obtained speech recognition grid lattice is relatively large, so that the speech recognition grid lattice can be pruned. One pruning method is to score lattice in the forward and backward directions, calculate the posterior probability of each jump edge, and delete the jump edges with lower posterior probability. Thus, after pruning the speech recognition mesh lattice, the speech recognition mesh lattice is simplified, but the important information in the speech recognition mesh lattice is still retained.

Based on the speech recognition lattice obtained after the pruning processing, a preset number of word sequences with top scores are extracted from the word sequences therein as candidate word sequences. And screening a target word sequence from the candidate word sequence as a voice recognition result. Here, the number of target word sequences is generally one.

In the embodiment of the application, the main decoding network is adopted to decode the acoustic features except the target named entity, and the sub decoding network is adopted to decode the acoustic features corresponding to the target named entity, so as to obtain the speech recognition grid lattice. The obtained speech recognition grid lattice comprises a plurality of alternative word sequences, so that the speech recognition grid lattice is pruned firstly, then the speech recognition grid lattice after pruning is screened, and finally a target word sequence is screened out to be used as a speech recognition result. Aiming at the target named entity in the scene to be identified, the target named entity can be accurately decoded based on the sub-decoding network. And pruning the speech recognition grid lattice decoded by the main decoding network and the sub decoding network and screening out a target word sequence as a speech recognition result. Therefore, the efficiency and accuracy of speech recognition are improved.

In one embodiment, the speech recognition lattice comprises a plurality of word sequences, wherein each word sequence comprises a node and a skip edge, and the skip edge carries word information of acoustic characteristics;

as shown in fig. 7, the method for decoding the acoustic features and the acoustic model scores of the acoustic features by using the main decoding network and the sub decoding network to obtain the speech recognition lattice includes:

and 720, sequentially acquiring the word sequences corresponding to the acoustic features in the voice data from the main decoding network.

Fig. 8 is a schematic diagram of a decoding process in an embodiment. And extracting acoustic features of the voice data to be processed. The extracted acoustic features are input into an acoustic model, and an acoustic model score of the acoustic features is calculated. The acoustic features include phonemes, but the present application is not limited thereto. For example, a segment of audio signal is received, and the acoustic features sequentially extracted from the segment of audio signal are eight phonemes n, i, h, ao, sh, en, zh, and en, and the word sequences corresponding to the eight phonemes are sequentially obtained from the main decoding network.

As shown in fig. 8, the word sequences corresponding to the eight phonemes are obtained from the main decoding network in the process of sequentially acquiring the word sequences, where node 1 is a start node,

nodes

2, 3, 4, and 5 are intermediate nodes, and node 6 is a stop node. A jump edge is connected between the starting node and the terminating node, and word information and phoneme information are carried on the jump edge. Wherein, the skip edge between the node 1 and the node 2 carries word information: hello; the phoneme information is carried as follows: n is the same as the formula (I). The skip edge between node 2 and node 3 carries word information: blank; the phoneme information is carried as follows: i. the skip edge between node 3 and node 4 carries word information: blank; the phoneme information is carried as follows: h. the skip edge between node 4 and node 5 carries word information: blank; the phoneme information is carried as follows: ao (a). The node 5 is a hollow node obtained by hollowing out the named entity, and therefore, the skip edge between the node 5 and the node 6 does not carry word information nor phoneme information.

And step 740, if the word information on the jumping edge of the middle node of the word sequence is null, calling the sub-decoding network, and acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the sub-decoding network.

In the process of sequentially acquiring the word sequences corresponding to the acoustic features from the main decoding network, if word information on a jump edge of a middle node of the word sequences is empty, the named entity at the middle node is hollowed. Therefore, if the word information on the jumping edge of the middle node of the word sequence is null, the sub-decoding network is called, and the word sequence corresponding to the next acoustic feature in the audio signal is obtained from the sub-decoding network.

As shown in fig. 8, no word information is carried between the jumping edges of the node 5, and no phoneme information is carried, i.e. the word information on the jumping edges of the node 5 is empty. At this time, the sub-decoding network is called, and the word sequence corresponding to the next acoustic feature in the audio signal is obtained from the sub-decoding network. That is, the next acoustic feature sh in the audio signal is obtained from the sub-decoding network, and the word sequences corresponding to en, zh, and en are obtained in sequence.

And 760, returning to the main coding network until a termination node of the word sequence is reached in the sub-decoding network, and continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until the word sequences of all the acoustic features in the audio signal to be processed are sequentially acquired.

Until a termination node of the word sequence is reached in the sub-decoding network, a word sequence is understood to be completely recognized. For example, as can be seen from fig. 8, the word sequence corresponding to several factors sh, en, zh, en completely identified in the sub-decoding network is "shenzhen", that is, the termination node that has reached the word sequence. Of course, the word sequences corresponding to several factors sh, en, zh, en may also be identified in the sub-decoding network as multiple alternative word sequences such as "magic needle", "nystag", and the like. At this point, it returns to the terminating node 6 of the main decoding network. And then, continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until the word sequences of all the acoustic features in the audio signal to be processed are sequentially acquired. For example, after the audio signal of "hello shenzhen", there is an audio signal of "i likes you", the next acoustic feature wo in the audio signal is continuously obtained from the main decoding network, and word sequences corresponding to x, i, h, u, an, n, i are sequentially obtained until word sequences of all acoustic features in the audio signal to be processed are sequentially obtained.

Step 780, connecting all the word sequences of the acoustic features in sequence to form a speech recognition grid lattice.

And connecting the word sequences of all acoustic features in the voice data in sequence to form a voice recognition grid lattice. For example, the word sequences of the speech data in fig. 8 are connected in sequence, so that a plurality of alternative word sequences such as "hello Shenzhen", and "hello Shenzhen" can be obtained. These sequences of candidate words constitute the speech recognition lattice.

In the embodiment of the application, the word sequences corresponding to the acoustic features in the voice data are sequentially acquired from the main decoding network. And if the word information on the jumping edge of the middle node of the word sequence is empty, calling the sub-decoding network, acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the sub-decoding network until the end node of the word sequence is reached in the sub-decoding network, returning to the main encoding network, and continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until the word sequences of all the acoustic features in the audio signal to be processed are sequentially acquired. And finally, sequentially connecting the word sequences of all the acoustic features to form a speech recognition grid lattice.

In the process of decoding by adopting the main decoding network, if word information of a skip edge of a middle node of a word sequence is empty, the sub-decoding network is called to decode. And finally, obtaining the speech recognition grid lattice based on the alternative word sequence obtained by decoding through the main decoding network and the sub decoding network. The target named entity is accurately identified through the sub-decoding network.

In one embodiment, as shown in FIG. 9, the jump edge also carries the language model score of the acoustic feature; obtaining a voice recognition result based on the voice recognition grid lattice, comprising:

step 920, acquiring the total score of each word sequence in the speech recognition grid lattice based on the acoustic model score of the acoustic characteristics and the language model score of the acoustic characteristics;

step 940, the word sequence with the highest total score of the word sequences is obtained and used as the target word sequence;

step 960, obtaining word information of the target word sequence, and using the word information in the target word sequence as a voice recognition result.

The speech recognition lattice comprises a plurality of sequences of candidate words. Wherein, the word sequence includes a plurality of words and a plurality of paths, which may also be called "lattice", which is essentially a directed acyclic graph (directed acyclic graph), each node on the graph represents an ending time point of a word, each jump edge represents a possible word, and an acoustic model score and a language model score of the occurrence of the word.

And for each word sequence, summing the acoustic model score and the language model score on each jump edge to obtain the total score of each word sequence in the speech recognition grid lattice. And then obtaining the word sequence with the highest total score of the word sequence as a target word sequence. And taking the word information of the target word sequence as a voice recognition result.

And shown in combination with fig. 8, if the total score of the word sequence of "hello shenzhen" is the highest, then the word information of the word sequence, namely "hello shenzhen", is used as the speech recognition result.

In the embodiment of the application, for each word sequence, the acoustic model score and the language model score on each jump edge are summed to obtain the total score of each word sequence in the speech recognition grid lattice. And then obtaining the word sequence with the highest total score of the word sequence as a target word sequence. And taking the word information of the target word sequence as a voice recognition result. Therefore, the target word sequence can be accurately screened out through the score of the acoustic model and the score of the language model.

In one embodiment, the scene to be identified comprises at least one of a medical scene, an image processing scene, and a bidding scene.

For a medical scenario, most of the target named entities in the application scenario are professional vocabularies, for example, the named entities are doctors, patients, blood pressure, heart beat, ct (computed tomogry), and the like. For the image processing scene, the target named entities are resolution, chromatic aberration, backlight, and the like. And for the electric competition scene, the target named entities are chicken eating, MVP and the like. Obviously, the target named entities in different application scenarios vary widely.

In the embodiment of the application, for the different application scenes including, but not limited to, a medical scene, an image processing scene, and a power competition scene, the target named entities in the different application scenes are respectively collected to form the target named entity text in each application scene. A language model score is then assigned to the target named entity text. And finally, combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network. Therefore, the target named entities under different application scenes can be subjected to targeted voice recognition, and the accuracy of the finally obtained voice recognition result is improved.

In one embodiment, as shown in fig. 10, a speech recognition apparatus 1000 includes:

an acoustic feature extraction module 1020, configured to perform acoustic feature extraction on the voice data to be processed;

an acoustic model score calculation module 1040, configured to input the extracted acoustic features into an acoustic model, and calculate an acoustic model score of the acoustic features;

the decoding module 1060 is configured to decode the acoustic features and the acoustic model scores of the acoustic features to obtain a speech recognition result by using a main decoding network and a sub-decoding network; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub-decoding graph is a decoding graph obtained by training a target named entity in a scene to be recognized.

In one embodiment, as shown in fig. 11, there is also provided a speech recognition apparatus 1000, further comprising: the main decoding network generation module 1070 is configured to perform hollowing processing on named entities in the original text corpus to obtain a target text corpus; training a target text training corpus to obtain a language model; training a voice training corpus corresponding to the original text training corpus to obtain an acoustic model; and combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus.

In one embodiment, the main decoding network generating module 1070 is further configured to combine the language model and the acoustic model by using a composition algorithm to obtain the main decoding network.

In one embodiment, as shown in fig. 11, there is also provided a speech recognition apparatus 1000, further comprising: the sub-decoding network generation module 1080 is used for acquiring a target named entity in a scene to be identified to form a target named entity text; assigning a language model score to the target named entity text; and combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network.

In one embodiment, the decoding module 1060 further includes:

the voice recognition grid generating unit is used for decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition grid;

and the voice recognition result determining unit is used for obtaining a voice recognition result based on the voice recognition grid lattice.

In one embodiment, the speech recognition lattice comprises a plurality of word sequences, wherein each word sequence comprises a node and a skip edge, and the skip edge carries word information of acoustic characteristics; the voice recognition grid generating unit is also used for sequentially acquiring word sequences corresponding to all acoustic features in the voice data from the main decoding network; if the word information on the jumping edge of the middle node of the word sequence is empty, calling a sub-decoding network, and acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the sub-decoding network; returning to the main coding network until a termination node of the word sequence is reached in the sub-decoding network, and continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main coding network until the word sequences of all the acoustic features in the audio signal to be processed are sequentially acquired; and connecting the word sequences of all the acoustic features in sequence to form a speech recognition grid lattice.

In one embodiment, the jump edge also carries a language model score for the acoustic feature; the voice recognition result determining unit is also used for acquiring the total score of each word sequence in the voice recognition grid lattice based on the acoustic model score of the acoustic characteristics and the language model score of the acoustic characteristics; acquiring a word sequence with the highest total score of the word sequence as a target word sequence; and acquiring word information of the target word sequence, and taking the word information in the target word sequence as a voice recognition result.

The division of the modules in the speech recognition apparatus is only for illustration, and in other embodiments, the speech recognition apparatus may be divided into different modules as needed to complete all or part of the functions of the speech recognition apparatus.

Fig. 12 is a schematic diagram of the internal structure of the server in one embodiment. As shown in fig. 12, the server includes a processor and a memory connected by a system bus. Wherein, the processor is used for providing calculation and control capability and supporting the operation of the whole server. The memory may include a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The computer program can be executed by a processor for implementing a speech recognition method provided in the following embodiments. The internal memory provides a cached execution environment for the operating system computer programs in the non-volatile storage medium. The server may be a mobile phone, a tablet computer, or a personal digital assistant or a wearable device, etc.

The implementation of each module in the speech recognition apparatus provided in the embodiments of the present application may be in the form of a computer program. The computer program may be run on a terminal or a server. The program modules constituted by the computer program may be stored on the memory of the terminal or the server. Which when executed by a processor, performs the steps of the method described in the embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the speech recognition method.

A computer program product comprising instructions which, when run on a computer, cause the computer to perform a speech recognition method.

Any reference to memory, storage, database, or other medium used by embodiments of the present application may include non-volatile and/or volatile memory. Suitable non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A speech recognition method, comprising:

extracting acoustic features of the voice data to be processed;

2. The method of claim 1, wherein the generation process of the primary decoding network comprises:

hollowing out the named entities in the original text training corpus to obtain a target text training corpus;

training the target text training corpus to obtain a language model;

training the voice training corpus corresponding to the original text training corpus to obtain an acoustic model;

and combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus.

3. The method of claim 2, wherein combining the language model with the acoustic model to obtain a primary decoding network comprises:

4. The method of claim 2, wherein the generation process of the sub-decoding network comprises:

acquiring a target named entity in a scene to be identified to form a target named entity text;

assigning a language model score to the target named entity text;

and combining the target named entity text endowed with the language model score with the acoustic model to obtain a sub-decoding network.

5. The method of claim 1, wherein the decoding the acoustic features and the acoustic model scores of the acoustic features using a main decoding network and a sub decoding network to obtain a speech recognition result comprises:

decoding the acoustic features and the acoustic model scores of the acoustic features by adopting the main decoding network and the sub decoding networks to obtain a speech recognition grid lattice;

6. The method of claim 5, wherein the speech recognition lattice comprises a plurality of word sequences, the word sequences comprising nodes and jump edges, the jump edges carrying word information of the acoustic features;

the decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a speech recognition grid lattice, and the method comprises the following steps:

sequentially acquiring word sequences corresponding to the acoustic features in the voice data from the main decoding network;

if the word information on the jumping edge of the middle node of the word sequence is empty, calling the sub-decoding network, and acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the sub-decoding network;

returning to the main coding network until a termination node of the word sequence is reached in the sub-decoding network, and continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until the word sequences of all the acoustic features in the audio signal to be processed are sequentially acquired;

and connecting the word sequences of all the acoustic features in sequence to form a speech recognition grid lattice.

7. The method of claim 6, wherein the jump edge also carries a language model score for the acoustic feature; obtaining a voice recognition result based on the voice recognition grid lattice, wherein the method comprises the following steps:

acquiring the total score of each word sequence in the speech recognition grid lattice based on the acoustic model score of the acoustic characteristics and the language model score of the acoustic characteristics;

acquiring a word sequence with the highest total score of the word sequence as a target word sequence;

and acquiring word information of the target word sequence, and taking the word information in the target word sequence as a voice recognition result.

8. The method of claim 4, wherein the scene to be identified comprises at least one of a medical scene, an image processing scene, and a telestration scene.

9. A speech recognition apparatus, characterized in that the apparatus comprises:

10. The apparatus of claim 9, further comprising: the main decoding network generation module is used for hollowing out named entities in the original text training corpus to obtain a target text training corpus; training the target text training corpus to obtain a language model; training the voice training corpus corresponding to the original text training corpus to obtain an acoustic model; and combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus.

11. A server comprising a memory and a processor, the memory having stored thereon a computer program, wherein the computer program, when executed by the processor, causes the processor to perform the steps of the speech recognition method according to any of claims 1 to 8.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 8.