CN113327585A

CN113327585A - Automatic voice recognition method based on deep neural network

Info

Publication number: CN113327585A
Application number: CN202110599305.9A
Authority: CN
Inventors: 王蒙; 付志勇; 胡奎; 姜黎; 潘艾婷
Original assignee: Hangzhou Ccvui Intelligent Technology Co ltd
Current assignee: Hangzhou Ccvui Intelligent Technology Co ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-31
Anticipated expiration: 2041-05-31
Also published as: CN113327585B

Abstract

The invention provides an automatic voice recognition method based on a deep neural network, and relates to the field of automatic voice recognition. According to the invention, the Log Fbank is used as an acoustic feature, and the feature of an acoustic signal can be sufficiently expressed only by a 40-dimensional feature, so that the signal processing process and the size of a neural network model are greatly simplified; a neural network model combining VGG and bidirectional GRU is adopted, and the GRU can fully utilize information of frames before and after the voice, so that the optimal recognition effect is obtained. The method is characterized in that continuous identical results are combined through CTC decoding, redundant results are removed, and the identified pinyin sequence is obtained through predefined pinyin list mapping without alignment, so that manual alignment operation is omitted; by adopting a hidden Markov language model, inputting a pinyin sequence as a model and obtaining a corresponding character recognition result; the optimal recognition result of different characters corresponding to the same pinyin can be given, so that the accuracy of automatic voice recognition is greatly improved.

Description

Automatic voice recognition method based on deep neural network

Technical Field

The invention relates to the field of automatic voice recognition, in particular to an automatic voice recognition method based on a deep neural network.

Background

With the continuous development of human-computer interaction technology, human-computer interaction modes become various, and from the past through text input interaction, the current stage can be conveniently and rapidly carried out through voice, which is an interaction technology which is continuously improved, wherein an automatic voice recognition technology is an extremely important part in the human-computer interaction technology.

An Automatic Speech Recognition technology (ASR) is mainly used for converting natural language contents in collected human Speech into computer-readable input contents, and the accuracy and speed of natural language Recognition directly concern the effectiveness and practicability of human-computer interaction. Therefore, how to improve the accuracy and speed of the automatic speech recognition technology becomes a widely discussed problem in the field of human-computer interaction at present.

For this purpose, the application numbers are: the invention application of CN201811112506.6 proposes a speech recognition method based on convolutional neural network, which comprises: preprocessing an input original voice signal; extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence; and constructing an acoustic model in an end-to-end mode by taking a joint meaning time classifier CTC as a loss function based on a DCNN network model. Training an acoustic model to obtain a trained acoustic model; and inputting the characteristic vector sequence to be recognized into the trained acoustic model to obtain a recognition result, and obtaining the finally recognized language characters by passing the recognition result through the language model.

The method is simple in modeling process and easy to train, but the adopted acoustic feature dimensions are too many, and the adopted acoustic feature dimensions contain much redundant information, so that the constructed neural network model is too large. Moreover, the DCNN model is too old, so that the learning capability of acoustic features is not enough, and the correlation between the frames before and after the voice can not be fully utilized.

Yet another application number is: the invention application of CN202010019733.5 provides an automatic speech recognition method and system based on artificial intelligence, which adopts a speech training recognition module to learn speech characteristics and speech corresponding character codes, firstly performs convolution learning on spectrum characteristics through a characteristic learning layer, then learns semantic information among the spectrum characteristics through a semantic learning layer, and finally decodes the comprehensively learned information through an output layer to output corresponding texts. Therefore, the label is coded and decoded by directly using the Chinese character mapping table, the text does not need to be subjected to phoneme coding and decoding, and then is decoded into the text, and the training process is simplified.

However, when MFCC is adopted as the acoustic feature, redundant information interference such as voiceprints exists, and the information does not help a single recognition task. The acoustic model adopts a neural network model structure of the CRNN, so that the convolution kernel of the convolution neural network is overlarge in size and the step length of the convolution kernel is overlarge, and the characteristic is not fine and smooth; and a too-deep and too-large cyclic neural network is used subsequently, so that the consequences such as gradient explosion or overfitting during training are easily caused.

Another application number is: the invention application of CN201811538408.9 provides a speech recognition training system and method, which pre-process the input speech, use CNN to extract speech signal characteristics, use RNN to recognize the characteristics, use homophonic loss function and approximate loss function to fit, finally achieve the purpose of speech recognition.

This application provides accuracy and speed of the system by providing a plurality of loss functions to cope with different situations according to commonly recognized errors. However, when CRNN is used as an acoustic model, the CNN may have a learning ability to features not as good as VGG, and the RNN may be difficult to train; and fitting by using the homophone loss function and the approximate loss function, and the problems of alignment of voice data, huge workload and the like are required.

Therefore, there is a need to provide a new method and system for providing better speech recognition to solve the above-mentioned technical problems.

Disclosure of Invention

In order to solve the technical problem, the invention provides an automatic speech recognition method based on a deep neural network, which comprises the following steps of:

sampling an original voice signal through audio acquisition equipment, and obtaining original voice data;

extracting Log Fbank acoustic characteristics of original voice data;

constructing an acoustic model;

inputting the Log Fbank acoustic characteristics into an acoustic model to obtain acoustic model output data;

performing CTC decoding on the acoustic model output data to obtain decoded data;

mapping the decoded data through a preset pinyin list to obtain a pinyin sequence;

and inputting the pinyin sequence into a language model for language identification, and obtaining a language identification result.

In particular, the method comprises the following steps of,

as a further solution, the audio acquisition device samples the original voice signal at a sampling rate of 16000Hz, the original voice data is stored in a 16bit integer, and the time length of each piece of original voice data is not more than 4 seconds.

As a further solution, extracting Log Fbank acoustic features of raw speech data requires the following steps:

pre-emphasis the original voice data through a high-pass filter;

performing framing operation on the pre-emphasis data through a framing function;

carrying out windowing operation by substituting each sub-frame into a window function;

performing fast Fourier transform on each windowed sub-frame signal to obtain an energy spectrum of each sub-frame;

performing dot product operation on the energy spectrum through a Mel filter group to obtain a Mel frequency spectrogram;

carrying out logarithmic transformation on the Mel frequency spectrogram;

and performing discrete cosine transform on the logarithm-transformed Mel frequency spectrogram.

As a further solution, the acoustic model is a neural network acoustic model adopting a combination of VGG and Bi-GRU, and the acoustic model comprises a VGG layer, a sense layer and a Bi-GRU layer; the acoustic model obtains original prediction data of Log Fbank acoustic features through the following steps:

the method comprises the steps that Log Fbank acoustic features are input into a VGG layer and output is sent to the next layer to be processed, 8 groups of VGG layers are arranged in the acoustic model in total, the 8 groups of VGG layers are sequentially connected in series end to end, VGG calculation is carried out for 8 times, and final VGG layer output data are obtained;

inputting the output data of the VGG layer into a Dense layer, and performing feature smoothing to obtain feature smooth output;

putting the characteristic smooth output into a Bi-GRU layer for calculation to obtain primary Bi-GRU layer output;

putting the primary Bi-GRU layer output into the Bi-GRU layer again to obtain secondary Bi-GRU layer output;

inputting the output of the secondary Bi-GRU layer to a Dense layer for feature smoothing to obtain secondary feature smooth output;

and inputting the secondary feature smooth output into the Dense layer again to obtain the output data of the acoustic model.

As a further solution, the VGG layer is formed by sequentially connecting a first CNN layer, a second CNN layer and a Max _ posing layer in series, where the first CNN layer and the second CNN layer are used for data convolution, the Max _ posing layer is used for data pooling, a convolution kernel of the first CNN layer is 5 × 5, and a convolution kernel of the second CNN layer is 3 × 3.

As a further solution, the CTC decoding is used to CTC process the acoustic model output data to merge the identical results that occur consecutively, removing redundant results.

As a further solution, the language model is a hidden markov language model, which takes a pinyin sequence as a model input and obtains a corresponding character recognition result; the language model performs pinyin-text conversion by the following steps:

s1, the pinyin sequence is used as input, and the pinyin sequence with the pinyin group as the basic segmentation unit is obtained through an initial and final segmentation processing method;

s2, mapping each pinyin group through a pinyin-character dictionary to obtain a corresponding character sequence, wherein different Chinese characters corresponding to the same pinyin group are stored in the character sequence;

s3, setting the initial probability value of all Chinese characters appearing in the character sequence corresponding to each pinyin group as 1;

s4, arranging and combining all Chinese characters in the character sequence corresponding to the adjacent pinyin groups according to two-character phrases; and storing as a screening sequence;

s5, constructing a two-word frequency dictionary, wherein appearance frequency values corresponding to commonly used two-word phrases, commonly used two-word phrases in the field, other two-word phrases and the two-word phrases are stored in the two-word probability dictionary;

s6, searching the two-word phrase of each permutation and combination in the screening sequence in a two-word phrase dictionary, if the two-word phrase exists, reserving the two-word phrase; if not, deleting; and obtaining the final state transition sequence;

s7, constructing a word frequency dictionary, wherein frequency values of the common word, the word in the field, other words and the word are stored in the word frequency dictionary;

s8, calculating the state transition probability of each two-word phrase in the state transition sequence, wherein the transition state formula is as follows:

wherein, A and B respectively represent the first character and the last character of the two-character phrase; p₀Representing an initial probability value; p₂(A.B) representing the occurrence frequency value corresponding to the two-character phrase; p₁(A) Representing a frequency value at which the first character appears as a word;

s9, comparing the state transition probability of each two-character phrase with a transition threshold, if the state transition probability is higher than the transition threshold, updating the current two-character phrase as an output result, and storing the current state transition probability value;

and S10, repeating the steps from S1 to S9 until all the transition probability values and corresponding output results are obtained, and sequentially arranging the output results to be output as a final language identification result.

As a further solution, the Bi-GRU unit, namely the bidirectional GRU neural network model, comprises a forward GRU unit and a reverse GRU unit, input data respectively enter the forward GRU unit and the reverse GRU unit for calculation, and the output of the forward GRU unit and the output of the reverse GRU unit are spliced/summed to be used as the output of the Bi-GRU unit.

As a further solution, the automatic speech recognition method is used for automatic speech recognition of a national and/or foreign language, and the pinyin-word dictionary is a dictionary corresponding to the pronunciation of the recognition language and the word.

As a further solution, the hamming window function is:

wherein n represents the intercepted signal; a is₀Representing a hamming window constant with a value of 25/46; n-1 represents the length of a cutting window of the Hamming window;

the mel filter function of the mel filter is as follows:

where f represents the filtered signal.

Compared with the related technology, the automatic speech recognition method based on the deep neural network provided by the invention has the following beneficial effects:

1. according to the invention, the Log Fbank is used as an acoustic feature, and the feature of the acoustic signal can be sufficiently expressed only by using 40-dimensional features, so that the signal processing process and the size of a neural network model are greatly simplified, and the method is greatly helpful for reducing the calculated amount and the storage space; a neural network model combining VGG and bidirectional GRU is adopted, and the GRU can fully utilize information of frames before and after the voice, so that the optimal recognition effect is obtained.

2. The invention combines continuous same results through CTC decoding, removes redundant results, obtains the identified pinyin sequence through predefined pinyin list mapping without alignment, and avoids manual alignment operation;

3. the invention adopts a hidden Markov language model, takes a pinyin sequence as a model input, and obtains a corresponding character recognition result; the optimal recognition result of different characters corresponding to the same pinyin can be given, so that the accuracy of automatic voice recognition is greatly improved.

Drawings

FIG. 1 is a system flow diagram illustrating an automatic speech recognition method based on deep neural network according to a preferred embodiment of the present invention;

FIG. 2 is a schematic diagram of an acoustic model according to an embodiment of the present invention;

FIG. 3 is a diagram of a Bi-GRU layer according to a preferred embodiment of the method for automatic speech recognition based on deep neural network of the present invention.

Detailed Description

The invention is further described with reference to the following figures and embodiments.

As shown in fig. 1 to 3, the automatic speech recognition method based on the deep neural network of the present invention performs automatic speech recognition by the following steps:

extracting Log Fbank acoustic characteristics of original voice data;

constructing an acoustic model;

pre-emphasis the original voice data through a high-pass filter;

carrying out logarithmic transformation on the Mel frequency spectrogram;

Specifically, the Log Fbank is adopted as the acoustic feature, and the feature of the acoustic signal can be sufficiently expressed only by the 40-dimensional feature, so that the signal processing process and the size of the neural network model are greatly simplified, and the method is greatly helpful for reducing the calculated amount and the storage space; a neural network model combining VGG and bidirectional GRU is adopted, VGG is a convolutional neural network model structure with the best feature learning capability at present, and GRU can fully utilize information of frames before and after a voice, so that the best recognition effect is obtained.

Specifically, the VGG model is a preferred algorithm for extracting acoustic features due to the fact that the number of layers is deeper and the feature map is wider, and the difference between different speakers can be eliminated greatly by adopting Log Fbank as the acoustic features; a neural network model combining VGG and bidirectional GRU is adopted, VGG is the convolutional neural network with the best feature learning capability at present, GRU is a branch of a cyclic neural network, and the training difficulty of GRU is far lower than that of the cyclic neural network.

Specifically, the CTC decoding is to merge continuous identical results, remove redundant results, and obtain the identified pinyin sequence through the mapping of a predefined pinyin list; for example: the identified results are as follows: "ABBBB" and our target output length is 4, then according to the coding requirement of CTC, preserving A, merging B, the result is "ABBB". And by using the CTC loss function, alignment is not needed, and manual alignment operation is avoided.

Specifically, the Bi-GRU is a bidirectional GRU neural network model, that is, the input of the Bi-GRU is calculated once according to the forward direction of the GRU flow, the input is calculated once according to the reverse direction of the input and once according to the flow of the GRU, and the outputs of the two results are spliced (or summed), and the model is as shown in fig. 3.

As a further solution, the hamming window function is:

wherein n represents the intercepted signal; a is₀Representing a hamming window constant with a value of 25/46; n-1 represents the truncation window length of the Hamming window.

It should be noted that: since the direct rectangular windowing of the signal can generate spectrum leakage due to truncation, in order to improve the spectrum leakage condition, the implementation carries out windowing through a Hamming window function, and due to the amplitude-frequency characteristic of the Hamming window, the side lobe attenuation is large, and the attenuation of the main lobe peak value and the first side lobe peak value is about 43db, so the spectrum leakage condition can be improved.

The mel filter function of the mel filter is as follows:

where f represents the filtered signal.

It should be noted that: since the level of the sound heard by the human ear is not linearly proportional to the frequency of the sound, the Mel frequency scale is more suitable for the auditory characteristics of the human ear. The mel filter is set to better satisfy the hearing habit of human ears.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An automatic speech recognition method based on a deep neural network is characterized in that automatic speech recognition is carried out through the following steps:

extracting Log Fbank acoustic characteristics of original voice data;

constructing an acoustic model;

2. The automatic speech recognition method based on deep neural network of claim 1, wherein the audio acquisition device samples the original speech signal at a sampling rate of 16000Hz, the original speech data is stored in 16bit integer, and the time duration of each piece of original speech data is not more than 4 seconds.

3. The automatic speech recognition method based on the deep neural network of claim 1, wherein the Log Fbank acoustic feature extraction of the original speech data requires the following steps:

pre-emphasis the original voice data through a high-pass filter;

carrying out logarithmic transformation on the Mel frequency spectrogram;

4. The automatic speech recognition method based on the deep neural network of claim 1, wherein the acoustic model is a neural network acoustic model adopting a combination of VGG plus Bi-GRU, and the acoustic model comprises a VGG layer, a Dense layer and a Bi-GRU layer; the acoustic model obtains original prediction data of Log Fbank acoustic features through the following steps:

5. The automatic speech recognition method based on deep neural network of claim 4, wherein the VGG layer is composed of a first CNN layer, a second CNN layer and a Max _ posing layer which are connected in series in sequence, wherein the first CNN layer and the second CNN layer are used for data convolution, the Max _ posing layer is used for data pooling, the convolution kernel of the first CNN layer is 5 × 5, and the convolution kernel of the second CNN layer is 3 × 3.

6. The method of claim 3, wherein the CTC decoding is used for CTC processing of the acoustic model output data to combine the same results that appear consecutively and remove redundant results.

7. The automatic speech recognition method based on deep neural network of claim 1, wherein the language model is hidden markov language model, the language model takes pinyin sequence as model input and obtains corresponding character recognition result; the language model performs pinyin-text conversion by the following steps:

8. The method of claim 4, wherein the Bi-GRU unit, i.e. the bidirectional GRU neural network model, comprises a forward GRU unit and a reverse GRU unit, the input data enters the forward GRU unit and the reverse GRU unit respectively for calculation, and the output of the forward GRU unit and the output of the reverse GRU unit are spliced/summed to be the output of the Bi-GRU unit.

9. The automatic speech recognition method based on deep neural network as claimed in claim 1, wherein the automatic speech recognition method is used for automatic speech recognition of national and/or foreign languages, and the Pinyin-character dictionary is a dictionary corresponding to the pronunciation of the corresponding recognition language and the character.

10. The method of claim 1, wherein the hamming window function is:

the mel filter function of the mel filter is as follows:

where f represents the filtered signal.