CN113327585A - Automatic voice recognition method based on deep neural network - Google Patents

Automatic voice recognition method based on deep neural network Download PDF

Info

Publication number
CN113327585A
CN113327585A CN202110599305.9A CN202110599305A CN113327585A CN 113327585 A CN113327585 A CN 113327585A CN 202110599305 A CN202110599305 A CN 202110599305A CN 113327585 A CN113327585 A CN 113327585A
Authority
CN
China
Prior art keywords
layer
output
gru
pinyin
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110599305.9A
Other languages
Chinese (zh)
Other versions
CN113327585B (en
Inventor
王蒙
付志勇
胡奎
姜黎
潘艾婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Ccvui Intelligent Technology Co ltd
Original Assignee
Hangzhou Ccvui Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Ccvui Intelligent Technology Co ltd filed Critical Hangzhou Ccvui Intelligent Technology Co ltd
Priority to CN202110599305.9A priority Critical patent/CN113327585B/en
Publication of CN113327585A publication Critical patent/CN113327585A/en
Application granted granted Critical
Publication of CN113327585B publication Critical patent/CN113327585B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention provides an automatic voice recognition method based on a deep neural network, and relates to the field of automatic voice recognition. According to the invention, the Log Fbank is used as an acoustic feature, and the feature of an acoustic signal can be sufficiently expressed only by a 40-dimensional feature, so that the signal processing process and the size of a neural network model are greatly simplified; a neural network model combining VGG and bidirectional GRU is adopted, and the GRU can fully utilize information of frames before and after the voice, so that the optimal recognition effect is obtained. The method is characterized in that continuous identical results are combined through CTC decoding, redundant results are removed, and the identified pinyin sequence is obtained through predefined pinyin list mapping without alignment, so that manual alignment operation is omitted; by adopting a hidden Markov language model, inputting a pinyin sequence as a model and obtaining a corresponding character recognition result; the optimal recognition result of different characters corresponding to the same pinyin can be given, so that the accuracy of automatic voice recognition is greatly improved.

Description

Automatic voice recognition method based on deep neural network
Technical Field
The invention relates to the field of automatic voice recognition, in particular to an automatic voice recognition method based on a deep neural network.
Background
With the continuous development of human-computer interaction technology, human-computer interaction modes become various, and from the past through text input interaction, the current stage can be conveniently and rapidly carried out through voice, which is an interaction technology which is continuously improved, wherein an automatic voice recognition technology is an extremely important part in the human-computer interaction technology.
An Automatic Speech Recognition technology (ASR) is mainly used for converting natural language contents in collected human Speech into computer-readable input contents, and the accuracy and speed of natural language Recognition directly concern the effectiveness and practicability of human-computer interaction. Therefore, how to improve the accuracy and speed of the automatic speech recognition technology becomes a widely discussed problem in the field of human-computer interaction at present.
For this purpose, the application numbers are: the invention application of CN201811112506.6 proposes a speech recognition method based on convolutional neural network, which comprises: preprocessing an input original voice signal; extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence; and constructing an acoustic model in an end-to-end mode by taking a joint meaning time classifier CTC as a loss function based on a DCNN network model. Training an acoustic model to obtain a trained acoustic model; and inputting the characteristic vector sequence to be recognized into the trained acoustic model to obtain a recognition result, and obtaining the finally recognized language characters by passing the recognition result through the language model.
The method is simple in modeling process and easy to train, but the adopted acoustic feature dimensions are too many, and the adopted acoustic feature dimensions contain much redundant information, so that the constructed neural network model is too large. Moreover, the DCNN model is too old, so that the learning capability of acoustic features is not enough, and the correlation between the frames before and after the voice can not be fully utilized.
Yet another application number is: the invention application of CN202010019733.5 provides an automatic speech recognition method and system based on artificial intelligence, which adopts a speech training recognition module to learn speech characteristics and speech corresponding character codes, firstly performs convolution learning on spectrum characteristics through a characteristic learning layer, then learns semantic information among the spectrum characteristics through a semantic learning layer, and finally decodes the comprehensively learned information through an output layer to output corresponding texts. Therefore, the label is coded and decoded by directly using the Chinese character mapping table, the text does not need to be subjected to phoneme coding and decoding, and then is decoded into the text, and the training process is simplified.
However, when MFCC is adopted as the acoustic feature, redundant information interference such as voiceprints exists, and the information does not help a single recognition task. The acoustic model adopts a neural network model structure of the CRNN, so that the convolution kernel of the convolution neural network is overlarge in size and the step length of the convolution kernel is overlarge, and the characteristic is not fine and smooth; and a too-deep and too-large cyclic neural network is used subsequently, so that the consequences such as gradient explosion or overfitting during training are easily caused.
Another application number is: the invention application of CN201811538408.9 provides a speech recognition training system and method, which pre-process the input speech, use CNN to extract speech signal characteristics, use RNN to recognize the characteristics, use homophonic loss function and approximate loss function to fit, finally achieve the purpose of speech recognition.
This application provides accuracy and speed of the system by providing a plurality of loss functions to cope with different situations according to commonly recognized errors. However, when CRNN is used as an acoustic model, the CNN may have a learning ability to features not as good as VGG, and the RNN may be difficult to train; and fitting by using the homophone loss function and the approximate loss function, and the problems of alignment of voice data, huge workload and the like are required.
Therefore, there is a need to provide a new method and system for providing better speech recognition to solve the above-mentioned technical problems.
Disclosure of Invention
In order to solve the technical problem, the invention provides an automatic speech recognition method based on a deep neural network, which comprises the following steps of:
sampling an original voice signal through audio acquisition equipment, and obtaining original voice data;
extracting Log Fbank acoustic characteristics of original voice data;
constructing an acoustic model;
inputting the Log Fbank acoustic characteristics into an acoustic model to obtain acoustic model output data;
performing CTC decoding on the acoustic model output data to obtain decoded data;
mapping the decoded data through a preset pinyin list to obtain a pinyin sequence;
and inputting the pinyin sequence into a language model for language identification, and obtaining a language identification result.
In particular, the method comprises the following steps of,
as a further solution, the audio acquisition device samples the original voice signal at a sampling rate of 16000Hz, the original voice data is stored in a 16bit integer, and the time length of each piece of original voice data is not more than 4 seconds.
As a further solution, extracting Log Fbank acoustic features of raw speech data requires the following steps:
pre-emphasis the original voice data through a high-pass filter;
performing framing operation on the pre-emphasis data through a framing function;
carrying out windowing operation by substituting each sub-frame into a window function;
performing fast Fourier transform on each windowed sub-frame signal to obtain an energy spectrum of each sub-frame;
performing dot product operation on the energy spectrum through a Mel filter group to obtain a Mel frequency spectrogram;
carrying out logarithmic transformation on the Mel frequency spectrogram;
and performing discrete cosine transform on the logarithm-transformed Mel frequency spectrogram.
As a further solution, the acoustic model is a neural network acoustic model adopting a combination of VGG and Bi-GRU, and the acoustic model comprises a VGG layer, a sense layer and a Bi-GRU layer; the acoustic model obtains original prediction data of Log Fbank acoustic features through the following steps:
the method comprises the steps that Log Fbank acoustic features are input into a VGG layer and output is sent to the next layer to be processed, 8 groups of VGG layers are arranged in the acoustic model in total, the 8 groups of VGG layers are sequentially connected in series end to end, VGG calculation is carried out for 8 times, and final VGG layer output data are obtained;
inputting the output data of the VGG layer into a Dense layer, and performing feature smoothing to obtain feature smooth output;
putting the characteristic smooth output into a Bi-GRU layer for calculation to obtain primary Bi-GRU layer output;
putting the primary Bi-GRU layer output into the Bi-GRU layer again to obtain secondary Bi-GRU layer output;
inputting the output of the secondary Bi-GRU layer to a Dense layer for feature smoothing to obtain secondary feature smooth output;
and inputting the secondary feature smooth output into the Dense layer again to obtain the output data of the acoustic model.
As a further solution, the VGG layer is formed by sequentially connecting a first CNN layer, a second CNN layer and a Max _ posing layer in series, where the first CNN layer and the second CNN layer are used for data convolution, the Max _ posing layer is used for data pooling, a convolution kernel of the first CNN layer is 5 × 5, and a convolution kernel of the second CNN layer is 3 × 3.
As a further solution, the CTC decoding is used to CTC process the acoustic model output data to merge the identical results that occur consecutively, removing redundant results.
As a further solution, the language model is a hidden markov language model, which takes a pinyin sequence as a model input and obtains a corresponding character recognition result; the language model performs pinyin-text conversion by the following steps:
s1, the pinyin sequence is used as input, and the pinyin sequence with the pinyin group as the basic segmentation unit is obtained through an initial and final segmentation processing method;
s2, mapping each pinyin group through a pinyin-character dictionary to obtain a corresponding character sequence, wherein different Chinese characters corresponding to the same pinyin group are stored in the character sequence;
s3, setting the initial probability value of all Chinese characters appearing in the character sequence corresponding to each pinyin group as 1;
s4, arranging and combining all Chinese characters in the character sequence corresponding to the adjacent pinyin groups according to two-character phrases; and storing as a screening sequence;
s5, constructing a two-word frequency dictionary, wherein appearance frequency values corresponding to commonly used two-word phrases, commonly used two-word phrases in the field, other two-word phrases and the two-word phrases are stored in the two-word probability dictionary;
s6, searching the two-word phrase of each permutation and combination in the screening sequence in a two-word phrase dictionary, if the two-word phrase exists, reserving the two-word phrase; if not, deleting; and obtaining the final state transition sequence;
s7, constructing a word frequency dictionary, wherein frequency values of the common word, the word in the field, other words and the word are stored in the word frequency dictionary;
s8, calculating the state transition probability of each two-word phrase in the state transition sequence, wherein the transition state formula is as follows:
Figure BDA0003092333790000041
wherein, A and B respectively represent the first character and the last character of the two-character phrase; p0Representing an initial probability value; p2(A.B) representing the occurrence frequency value corresponding to the two-character phrase; p1(A) Representing a frequency value at which the first character appears as a word;
s9, comparing the state transition probability of each two-character phrase with a transition threshold, if the state transition probability is higher than the transition threshold, updating the current two-character phrase as an output result, and storing the current state transition probability value;
and S10, repeating the steps from S1 to S9 until all the transition probability values and corresponding output results are obtained, and sequentially arranging the output results to be output as a final language identification result.
As a further solution, the Bi-GRU unit, namely the bidirectional GRU neural network model, comprises a forward GRU unit and a reverse GRU unit, input data respectively enter the forward GRU unit and the reverse GRU unit for calculation, and the output of the forward GRU unit and the output of the reverse GRU unit are spliced/summed to be used as the output of the Bi-GRU unit.
As a further solution, the automatic speech recognition method is used for automatic speech recognition of a national and/or foreign language, and the pinyin-word dictionary is a dictionary corresponding to the pronunciation of the recognition language and the word.
As a further solution, the hamming window function is:
Figure BDA0003092333790000042
wherein n represents the intercepted signal; a is0Representing a hamming window constant with a value of 25/46; n-1 represents the length of a cutting window of the Hamming window;
the mel filter function of the mel filter is as follows:
Figure BDA0003092333790000043
where f represents the filtered signal.
Compared with the related technology, the automatic speech recognition method based on the deep neural network provided by the invention has the following beneficial effects:
1. according to the invention, the Log Fbank is used as an acoustic feature, and the feature of the acoustic signal can be sufficiently expressed only by using 40-dimensional features, so that the signal processing process and the size of a neural network model are greatly simplified, and the method is greatly helpful for reducing the calculated amount and the storage space; a neural network model combining VGG and bidirectional GRU is adopted, and the GRU can fully utilize information of frames before and after the voice, so that the optimal recognition effect is obtained.
2. The invention combines continuous same results through CTC decoding, removes redundant results, obtains the identified pinyin sequence through predefined pinyin list mapping without alignment, and avoids manual alignment operation;
3. the invention adopts a hidden Markov language model, takes a pinyin sequence as a model input, and obtains a corresponding character recognition result; the optimal recognition result of different characters corresponding to the same pinyin can be given, so that the accuracy of automatic voice recognition is greatly improved.
Drawings
FIG. 1 is a system flow diagram illustrating an automatic speech recognition method based on deep neural network according to a preferred embodiment of the present invention;
FIG. 2 is a schematic diagram of an acoustic model according to an embodiment of the present invention;
FIG. 3 is a diagram of a Bi-GRU layer according to a preferred embodiment of the method for automatic speech recognition based on deep neural network of the present invention.
Detailed Description
The invention is further described with reference to the following figures and embodiments.
As shown in fig. 1 to 3, the automatic speech recognition method based on the deep neural network of the present invention performs automatic speech recognition by the following steps:
sampling an original voice signal through audio acquisition equipment, and obtaining original voice data;
extracting Log Fbank acoustic characteristics of original voice data;
constructing an acoustic model;
inputting the Log Fbank acoustic characteristics into an acoustic model to obtain acoustic model output data;
performing CTC decoding on the acoustic model output data to obtain decoded data;
mapping the decoded data through a preset pinyin list to obtain a pinyin sequence;
and inputting the pinyin sequence into a language model for language identification, and obtaining a language identification result.
As a further solution, the audio acquisition device samples the original voice signal at a sampling rate of 16000Hz, the original voice data is stored in a 16bit integer, and the time length of each piece of original voice data is not more than 4 seconds.
As a further solution, extracting Log Fbank acoustic features of raw speech data requires the following steps:
pre-emphasis the original voice data through a high-pass filter;
performing framing operation on the pre-emphasis data through a framing function;
carrying out windowing operation by substituting each sub-frame into a window function;
performing fast Fourier transform on each windowed sub-frame signal to obtain an energy spectrum of each sub-frame;
performing dot product operation on the energy spectrum through a Mel filter group to obtain a Mel frequency spectrogram;
carrying out logarithmic transformation on the Mel frequency spectrogram;
and performing discrete cosine transform on the logarithm-transformed Mel frequency spectrogram.
Specifically, the Log Fbank is adopted as the acoustic feature, and the feature of the acoustic signal can be sufficiently expressed only by the 40-dimensional feature, so that the signal processing process and the size of the neural network model are greatly simplified, and the method is greatly helpful for reducing the calculated amount and the storage space; a neural network model combining VGG and bidirectional GRU is adopted, VGG is a convolutional neural network model structure with the best feature learning capability at present, and GRU can fully utilize information of frames before and after a voice, so that the best recognition effect is obtained.
As a further solution, the acoustic model is a neural network acoustic model adopting a combination of VGG and Bi-GRU, and the acoustic model comprises a VGG layer, a sense layer and a Bi-GRU layer; the acoustic model obtains original prediction data of Log Fbank acoustic features through the following steps:
the method comprises the steps that Log Fbank acoustic features are input into a VGG layer and output is sent to the next layer to be processed, 8 groups of VGG layers are arranged in the acoustic model in total, the 8 groups of VGG layers are sequentially connected in series end to end, VGG calculation is carried out for 8 times, and final VGG layer output data are obtained;
inputting the output data of the VGG layer into a Dense layer, and performing feature smoothing to obtain feature smooth output;
putting the characteristic smooth output into a Bi-GRU layer for calculation to obtain primary Bi-GRU layer output;
putting the primary Bi-GRU layer output into the Bi-GRU layer again to obtain secondary Bi-GRU layer output;
inputting the output of the secondary Bi-GRU layer to a Dense layer for feature smoothing to obtain secondary feature smooth output;
and inputting the secondary feature smooth output into the Dense layer again to obtain the output data of the acoustic model.
As a further solution, the VGG layer is formed by sequentially connecting a first CNN layer, a second CNN layer and a Max _ posing layer in series, where the first CNN layer and the second CNN layer are used for data convolution, the Max _ posing layer is used for data pooling, a convolution kernel of the first CNN layer is 5 × 5, and a convolution kernel of the second CNN layer is 3 × 3.
Specifically, the VGG model is a preferred algorithm for extracting acoustic features due to the fact that the number of layers is deeper and the feature map is wider, and the difference between different speakers can be eliminated greatly by adopting Log Fbank as the acoustic features; a neural network model combining VGG and bidirectional GRU is adopted, VGG is the convolutional neural network with the best feature learning capability at present, GRU is a branch of a cyclic neural network, and the training difficulty of GRU is far lower than that of the cyclic neural network.
As a further solution, the CTC decoding is used to CTC process the acoustic model output data to merge the identical results that occur consecutively, removing redundant results.
Specifically, the CTC decoding is to merge continuous identical results, remove redundant results, and obtain the identified pinyin sequence through the mapping of a predefined pinyin list; for example: the identified results are as follows: "ABBBB" and our target output length is 4, then according to the coding requirement of CTC, preserving A, merging B, the result is "ABBB". And by using the CTC loss function, alignment is not needed, and manual alignment operation is avoided.
As a further solution, the language model is a hidden markov language model, which takes a pinyin sequence as a model input and obtains a corresponding character recognition result; the language model performs pinyin-text conversion by the following steps:
s1, the pinyin sequence is used as input, and the pinyin sequence with the pinyin group as the basic segmentation unit is obtained through an initial and final segmentation processing method;
s2, mapping each pinyin group through a pinyin-character dictionary to obtain a corresponding character sequence, wherein different Chinese characters corresponding to the same pinyin group are stored in the character sequence;
s3, setting the initial probability value of all Chinese characters appearing in the character sequence corresponding to each pinyin group as 1;
s4, arranging and combining all Chinese characters in the character sequence corresponding to the adjacent pinyin groups according to two-character phrases; and storing as a screening sequence;
s5, constructing a two-word frequency dictionary, wherein appearance frequency values corresponding to commonly used two-word phrases, commonly used two-word phrases in the field, other two-word phrases and the two-word phrases are stored in the two-word probability dictionary;
s6, searching the two-word phrase of each permutation and combination in the screening sequence in a two-word phrase dictionary, if the two-word phrase exists, reserving the two-word phrase; if not, deleting; and obtaining the final state transition sequence;
s7, constructing a word frequency dictionary, wherein frequency values of the common word, the word in the field, other words and the word are stored in the word frequency dictionary;
s8, calculating the state transition probability of each two-word phrase in the state transition sequence, wherein the transition state formula is as follows:
Figure BDA0003092333790000081
wherein, A and B respectively represent the first character and the last character of the two-character phrase; p0Representing an initial probability value; p2(A.B) representing the occurrence frequency value corresponding to the two-character phrase; p1(A) Representing a frequency value at which the first character appears as a word;
s9, comparing the state transition probability of each two-character phrase with a transition threshold, if the state transition probability is higher than the transition threshold, updating the current two-character phrase as an output result, and storing the current state transition probability value;
and S10, repeating the steps from S1 to S9 until all the transition probability values and corresponding output results are obtained, and sequentially arranging the output results to be output as a final language identification result.
As a further solution, the Bi-GRU unit, namely the bidirectional GRU neural network model, comprises a forward GRU unit and a reverse GRU unit, input data respectively enter the forward GRU unit and the reverse GRU unit for calculation, and the output of the forward GRU unit and the output of the reverse GRU unit are spliced/summed to be used as the output of the Bi-GRU unit.
Specifically, the Bi-GRU is a bidirectional GRU neural network model, that is, the input of the Bi-GRU is calculated once according to the forward direction of the GRU flow, the input is calculated once according to the reverse direction of the input and once according to the flow of the GRU, and the outputs of the two results are spliced (or summed), and the model is as shown in fig. 3.
As a further solution, the automatic speech recognition method is used for automatic speech recognition of a national and/or foreign language, and the pinyin-word dictionary is a dictionary corresponding to the pronunciation of the recognition language and the word.
As a further solution, the hamming window function is:
Figure BDA0003092333790000082
wherein n represents the intercepted signal; a is0Representing a hamming window constant with a value of 25/46; n-1 represents the truncation window length of the Hamming window.
It should be noted that: since the direct rectangular windowing of the signal can generate spectrum leakage due to truncation, in order to improve the spectrum leakage condition, the implementation carries out windowing through a Hamming window function, and due to the amplitude-frequency characteristic of the Hamming window, the side lobe attenuation is large, and the attenuation of the main lobe peak value and the first side lobe peak value is about 43db, so the spectrum leakage condition can be improved.
The mel filter function of the mel filter is as follows:
Figure BDA0003092333790000083
where f represents the filtered signal.
It should be noted that: since the level of the sound heard by the human ear is not linearly proportional to the frequency of the sound, the Mel frequency scale is more suitable for the auditory characteristics of the human ear. The mel filter is set to better satisfy the hearing habit of human ears.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An automatic speech recognition method based on a deep neural network is characterized in that automatic speech recognition is carried out through the following steps:
sampling an original voice signal through audio acquisition equipment, and obtaining original voice data;
extracting Log Fbank acoustic characteristics of original voice data;
constructing an acoustic model;
inputting the Log Fbank acoustic characteristics into an acoustic model to obtain acoustic model output data;
performing CTC decoding on the acoustic model output data to obtain decoded data;
mapping the decoded data through a preset pinyin list to obtain a pinyin sequence;
and inputting the pinyin sequence into a language model for language identification, and obtaining a language identification result.
2. The automatic speech recognition method based on deep neural network of claim 1, wherein the audio acquisition device samples the original speech signal at a sampling rate of 16000Hz, the original speech data is stored in 16bit integer, and the time duration of each piece of original speech data is not more than 4 seconds.
3. The automatic speech recognition method based on the deep neural network of claim 1, wherein the Log Fbank acoustic feature extraction of the original speech data requires the following steps:
pre-emphasis the original voice data through a high-pass filter;
performing framing operation on the pre-emphasis data through a framing function;
carrying out windowing operation by substituting each sub-frame into a window function;
performing fast Fourier transform on each windowed sub-frame signal to obtain an energy spectrum of each sub-frame;
performing dot product operation on the energy spectrum through a Mel filter group to obtain a Mel frequency spectrogram;
carrying out logarithmic transformation on the Mel frequency spectrogram;
and performing discrete cosine transform on the logarithm-transformed Mel frequency spectrogram.
4. The automatic speech recognition method based on the deep neural network of claim 1, wherein the acoustic model is a neural network acoustic model adopting a combination of VGG plus Bi-GRU, and the acoustic model comprises a VGG layer, a Dense layer and a Bi-GRU layer; the acoustic model obtains original prediction data of Log Fbank acoustic features through the following steps:
the method comprises the steps that Log Fbank acoustic features are input into a VGG layer and output is sent to the next layer to be processed, 8 groups of VGG layers are arranged in the acoustic model in total, the 8 groups of VGG layers are sequentially connected in series end to end, VGG calculation is carried out for 8 times, and final VGG layer output data are obtained;
inputting the output data of the VGG layer into a Dense layer, and performing feature smoothing to obtain feature smooth output;
putting the characteristic smooth output into a Bi-GRU layer for calculation to obtain primary Bi-GRU layer output;
putting the primary Bi-GRU layer output into the Bi-GRU layer again to obtain secondary Bi-GRU layer output;
inputting the output of the secondary Bi-GRU layer to a Dense layer for feature smoothing to obtain secondary feature smooth output;
and inputting the secondary feature smooth output into the Dense layer again to obtain the output data of the acoustic model.
5. The automatic speech recognition method based on deep neural network of claim 4, wherein the VGG layer is composed of a first CNN layer, a second CNN layer and a Max _ posing layer which are connected in series in sequence, wherein the first CNN layer and the second CNN layer are used for data convolution, the Max _ posing layer is used for data pooling, the convolution kernel of the first CNN layer is 5 × 5, and the convolution kernel of the second CNN layer is 3 × 3.
6. The method of claim 3, wherein the CTC decoding is used for CTC processing of the acoustic model output data to combine the same results that appear consecutively and remove redundant results.
7. The automatic speech recognition method based on deep neural network of claim 1, wherein the language model is hidden markov language model, the language model takes pinyin sequence as model input and obtains corresponding character recognition result; the language model performs pinyin-text conversion by the following steps:
s1, the pinyin sequence is used as input, and the pinyin sequence with the pinyin group as the basic segmentation unit is obtained through an initial and final segmentation processing method;
s2, mapping each pinyin group through a pinyin-character dictionary to obtain a corresponding character sequence, wherein different Chinese characters corresponding to the same pinyin group are stored in the character sequence;
s3, setting the initial probability value of all Chinese characters appearing in the character sequence corresponding to each pinyin group as 1;
s4, arranging and combining all Chinese characters in the character sequence corresponding to the adjacent pinyin groups according to two-character phrases; and storing as a screening sequence;
s5, constructing a two-word frequency dictionary, wherein appearance frequency values corresponding to commonly used two-word phrases, commonly used two-word phrases in the field, other two-word phrases and the two-word phrases are stored in the two-word probability dictionary;
s6, searching the two-word phrase of each permutation and combination in the screening sequence in a two-word phrase dictionary, if the two-word phrase exists, reserving the two-word phrase; if not, deleting; and obtaining the final state transition sequence;
s7, constructing a word frequency dictionary, wherein frequency values of the common word, the word in the field, other words and the word are stored in the word frequency dictionary;
s8, calculating the state transition probability of each two-word phrase in the state transition sequence, wherein the transition state formula is as follows:
Figure FDA0003092333780000031
wherein, A and B respectively represent the first character and the last character of the two-character phrase; p0Representing an initial probability value; p2(A.B) representing the occurrence frequency value corresponding to the two-character phrase; p1(A) Representing a frequency value at which the first character appears as a word;
s9, comparing the state transition probability of each two-character phrase with a transition threshold, if the state transition probability is higher than the transition threshold, updating the current two-character phrase as an output result, and storing the current state transition probability value;
and S10, repeating the steps from S1 to S9 until all the transition probability values and corresponding output results are obtained, and sequentially arranging the output results to be output as a final language identification result.
8. The method of claim 4, wherein the Bi-GRU unit, i.e. the bidirectional GRU neural network model, comprises a forward GRU unit and a reverse GRU unit, the input data enters the forward GRU unit and the reverse GRU unit respectively for calculation, and the output of the forward GRU unit and the output of the reverse GRU unit are spliced/summed to be the output of the Bi-GRU unit.
9. The automatic speech recognition method based on deep neural network as claimed in claim 1, wherein the automatic speech recognition method is used for automatic speech recognition of national and/or foreign languages, and the Pinyin-character dictionary is a dictionary corresponding to the pronunciation of the corresponding recognition language and the character.
10. The method of claim 1, wherein the hamming window function is:
Figure FDA0003092333780000032
wherein n represents the intercepted signal; a is0Representing a hamming window constant with a value of 25/46; n-1 represents the length of a cutting window of the Hamming window;
the mel filter function of the mel filter is as follows:
Figure FDA0003092333780000041
where f represents the filtered signal.
CN202110599305.9A 2021-05-31 2021-05-31 Automatic voice recognition method based on deep neural network Active CN113327585B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110599305.9A CN113327585B (en) 2021-05-31 2021-05-31 Automatic voice recognition method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110599305.9A CN113327585B (en) 2021-05-31 2021-05-31 Automatic voice recognition method based on deep neural network

Publications (2)

Publication Number Publication Date
CN113327585A true CN113327585A (en) 2021-08-31
CN113327585B CN113327585B (en) 2023-05-12

Family

ID=77422581

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110599305.9A Active CN113327585B (en) 2021-05-31 2021-05-31 Automatic voice recognition method based on deep neural network

Country Status (1)

Country Link
CN (1) CN113327585B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113744722A (en) * 2021-09-13 2021-12-03 上海交通大学宁波人工智能研究院 Off-line speech recognition matching device and method for limited sentence library
CN116580706A (en) * 2023-07-14 2023-08-11 合肥朗永智能科技有限公司 Speech recognition method based on artificial intelligence

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108701452A (en) * 2016-02-02 2018-10-23 日本电信电话株式会社 Audio model learning method, audio recognition method, audio model learning device, speech recognition equipment, audio model learning program and speech recognition program
US20190057683A1 (en) * 2017-08-18 2019-02-21 Google Llc Encoder-decoder models for sequence to sequence mapping
US10229672B1 (en) * 2015-12-31 2019-03-12 Google Llc Training acoustic models using connectionist temporal classification
US20200027444A1 (en) * 2018-07-20 2020-01-23 Google Llc Speech recognition with sequence-to-sequence models
CN111063336A (en) * 2019-12-30 2020-04-24 天津中科智能识别产业技术研究院有限公司 End-to-end voice recognition system based on deep learning
US20200335082A1 (en) * 2019-04-16 2020-10-22 Microsoft Technology Licensing, Llc Code-switching speech recognition with end-to-end connectionist temporal classification model
CN111899727A (en) * 2020-07-15 2020-11-06 苏州思必驰信息科技有限公司 Training method and system for voice recognition model of multiple speakers

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10229672B1 (en) * 2015-12-31 2019-03-12 Google Llc Training acoustic models using connectionist temporal classification
CN108701452A (en) * 2016-02-02 2018-10-23 日本电信电话株式会社 Audio model learning method, audio recognition method, audio model learning device, speech recognition equipment, audio model learning program and speech recognition program
US20190057683A1 (en) * 2017-08-18 2019-02-21 Google Llc Encoder-decoder models for sequence to sequence mapping
US20200027444A1 (en) * 2018-07-20 2020-01-23 Google Llc Speech recognition with sequence-to-sequence models
US20200335082A1 (en) * 2019-04-16 2020-10-22 Microsoft Technology Licensing, Llc Code-switching speech recognition with end-to-end connectionist temporal classification model
CN111063336A (en) * 2019-12-30 2020-04-24 天津中科智能识别产业技术研究院有限公司 End-to-end voice recognition system based on deep learning
CN111899727A (en) * 2020-07-15 2020-11-06 苏州思必驰信息科技有限公司 Training method and system for voice recognition model of multiple speakers

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
WEIZHE WANG等: "End-to-End Low-Resource Speech Recognition with a Deep CNN-LSTM Encoder", 《2020 IEEE 3RD INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SIGNAL PROCESSING (ICICSP)》 *
ZHIHAO DU等: "Investigation of Monaural Front-End Processing for Robust Speech Recognition Without Retraining or Joint-Training", 《2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC)》 *
刘柏基: "基于注意力机制的端到端语音识别应用研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
卢云聪: "基于CNN的声学模型构建与实验", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
杜志浩等: "基于听觉掩蔽生成对抗网络的单通道语音增强方法", 《智能计算机与应用》 *
潘粤成等: "一种基于CNN/CTC的端到端普通话语音识别方法", 《现代信息科技》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113744722A (en) * 2021-09-13 2021-12-03 上海交通大学宁波人工智能研究院 Off-line speech recognition matching device and method for limited sentence library
CN116580706A (en) * 2023-07-14 2023-08-11 合肥朗永智能科技有限公司 Speech recognition method based on artificial intelligence
CN116580706B (en) * 2023-07-14 2023-09-22 合肥朗永智能科技有限公司 Speech recognition method based on artificial intelligence

Also Published As

Publication number Publication date
CN113327585B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
CN110827801B (en) Automatic voice recognition method and system based on artificial intelligence
CN110534089B (en) Chinese speech synthesis method based on phoneme and prosodic structure
WO2022083083A1 (en) Sound conversion system and training method for same
CN109065032B (en) External corpus speech recognition method based on deep convolutional neural network
CN111798840B (en) Voice keyword recognition method and device
CN111210807B (en) Speech recognition model training method, system, mobile terminal and storage medium
CN110797002B (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN113327585B (en) Automatic voice recognition method based on deep neural network
CN111063336A (en) End-to-end voice recognition system based on deep learning
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
CN113160798A (en) Chinese civil aviation air traffic control voice recognition method and system
CN114944150A (en) Dual-task-based Conformer land-air communication acoustic model construction method
CN112489634A (en) Language acoustic model training method and device, electronic equipment and computer medium
CN112489651A (en) Voice recognition method, electronic device and storage device
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
Diwan et al. Reduce and reconstruct: ASR for low-resource phonetic languages
Alrumiah et al. A Deep Diacritics-Based Recognition Model for Arabic Speech: Quranic Verses as Case Study
CN111128191B (en) Online end-to-end voice transcription method and system
CN113903349A (en) Training method of noise reduction model, noise reduction method, device and storage medium
Iswarya et al. Speech query recognition for Tamil language using wavelet and wavelet packets
Youa et al. Research on dialect speech recognition based on DenseNet-CTC
CN114743545B (en) Dialect type prediction model training method and device and storage medium
CN112151008B (en) Voice synthesis method, system and computer equipment
Yue et al. An Improved Speech Recognition System Based on Transformer Language Model
Vijaya et al. An Efficient System for Audio-Based Sign Language Translator Through MFCC Feature Extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant