CN111667819B - Voice recognition method, system, storage medium and electronic equipment based on CRNN - Google Patents

Voice recognition method, system, storage medium and electronic equipment based on CRNN Download PDF

Info

Publication number
CN111667819B
CN111667819B CN201910177117.XA CN201910177117A CN111667819B CN 111667819 B CN111667819 B CN 111667819B CN 201910177117 A CN201910177117 A CN 201910177117A CN 111667819 B CN111667819 B CN 111667819B
Authority
CN
China
Prior art keywords
layer
convolution
data
state value
rnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910177117.XA
Other languages
Chinese (zh)
Other versions
CN111667819A (en
Inventor
仇璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910177117.XA priority Critical patent/CN111667819B/en
Publication of CN111667819A publication Critical patent/CN111667819A/en
Application granted granted Critical
Publication of CN111667819B publication Critical patent/CN111667819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a CRNN-based voice recognition method, a system, a storage medium and electronic equipment, wherein the recognition method comprises the following steps: acquiring processed voice data, wherein the processed voice data is voice data within a preset filter width value range from the current frame position of the voice data after preprocessing pointed by a filter bank; inputting the processed voice data into a convolution layer, and outputting by the convolution layer to obtain a frame of convolution output frame data; updating the orientation of the filter bank; judging whether the voice data of the preset characteristic length of the filter bank is obtained, if not, returning to the step of obtaining the processed voice data, and if so, inputting the obtained convolution output frame data of all frames into the RNN layer to obtain an output state value; and inputting the output state value into the full connection layer to obtain a voice recognition result. Compared with the traditional method, the method has the advantages that the processed data volume is greatly reduced, the calculation speed can be improved, the occupied space of the memory is reduced, and the purpose of voice recognition is realized in real time.

Description

Voice recognition method, system, storage medium and electronic equipment based on CRNN
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a CRNN (convolutional neural network) -based speech recognition method, system, storage medium, and electronic device.
Background
With the development of computer technology, electronic products applied by people increasingly tend to be intelligent, voice is used as the most commonly used interaction mode of human beings, and intelligent voice recognition and voice awakening technology are also used on end equipment to become a hot spot in the intelligent direction. Compared with the traditional voice recognition algorithm, the deep learning algorithm has the advantages of higher accuracy, stronger plasticity, universality and the like, and becomes the mainstream of voice recognition technology applied to voice recognition and voice awakening systems. The CRNN with excellent recognition rate is one of the common deep learning neural networks for speech recognition.
The CRNN comprises a convolution layer, an RNN (cyclic neural network) layer and a full-connection layer, receives the preprocessed voice data with fixed length when performing voice recognition, and outputs recognition results through the convolution layer, the RNN layer and the full-connection layer in sequence, so that the calculation amount is large. Particularly in a voice wake-up application environment, in order to solve the problem of all-around interference (echo, reverberation, interference sound source) introduced by surrounding environment and transmission media, a microphone array is generally adopted to capture and pre-process voice data, and then the voice data is transmitted to a CRNN for voice recognition of the pre-processed voice data, but because of large calculation amount, the calculation capacity of a receiving end device and the limitation of the number of distributed beams of the microphones, the CRNN is difficult to perform voice recognition calculation in real time, and thus the positioning accuracy and the wake-up accuracy of voice wake-up are affected.
Disclosure of Invention
The invention aims to overcome the defects that in the prior art, the CRNN has large calculated amount when processing voice information at an end device and is difficult to calculate voice recognition in real time, and provides a voice recognition method, a voice recognition system, a storage medium and electronic equipment based on the CRNN.
The invention solves the technical problems by the following technical scheme:
the embodiment of the invention provides a voice recognition method based on CRNN, wherein the CRNN comprises a convolution layer, an RNN layer and a full connection layer, and the convolution layer comprises a filter bank, and the voice recognition method is characterized by comprising the following steps:
acquiring processed voice data, wherein the processed voice data is voice data in a preset filtering width value range from the current frame position of the voice data after preprocessing pointed by the filter bank;
inputting the processed voice data into the convolution layer, and outputting by the convolution layer to obtain one frame of convolution output frame data;
updating the orientation of the filter bank;
judging whether the voice data of the preset characteristic length of the filter bank is obtained, if the judgment result is negative, returning to the step of obtaining the processed voice data, if the judgment result is positive, inputting the obtained convolution output frame data of all frames into the RNN layer, and outputting by the RNN layer to obtain an output state value;
and inputting the output state value into the full-connection layer, and outputting a voice recognition result corresponding to the voice data with the preset characteristic length by the full-connection layer.
Preferably, the step of inputting the obtained convolved output frame data of all frames into the RNN layer, and the step of outputting the obtained output state value by the RNN layer includes:
inputting the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated in the previous time to the RNN layer, and outputting the updated intermediate state value by the RNN layer;
judging whether the convolution output frame data of all frames are input completely, if the judgment result is no, taking the convolution output frame data of the next frame as the convolution output frame data of the current frame, setting the updated intermediate state value as the intermediate state value of the RNN layer updated last time, and returning to the step of inputting the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated last time into the RNN layer; and if the judgment result is yes, taking the last state value of the RNN layer as an output state value.
Preferably, before the step of inputting the intermediate state value of the RNN layer updated last time and the convolved output frame data of the current frame into the RNN layer, the method further comprises:
the state of the RNN layer is initialized to an initial state value.
The invention provides a voice recognition system based on CRNN, which comprises a convolution layer, an RNN layer and a full connection layer, wherein the convolution layer comprises a filter bank, and the voice recognition system comprises an acquisition module, a convolution module, an updating module, a recognition module and a full connection module;
the acquisition module is used for acquiring processed voice data, wherein the processed voice data is voice data in a preset filter width value range from the current frame position of the voice data after preprocessing pointed by the filter bank;
the convolution module is used for inputting the processed voice data into the convolution layer, and the convolution layer outputs and obtains one frame of convolution output frame data;
the updating module is used for updating the direction of the filter bank;
the recognition module is used for judging whether voice data of the preset characteristic length of the filter bank are acquired or not, if the judgment result is negative, the acquisition module is called, if the judgment result is positive, the obtained convolution output frame data of all frames are input to the RNN layer, and the RNN layer outputs to obtain an output state value;
the full connection module is used for inputting the output state value into the full connection layer, and the full connection layer outputs a voice recognition result corresponding to the voice data with the preset characteristic length.
Preferably, the identification module is further configured to input, to the RNN layer, the convolutionally output frame data of the current frame and an intermediate state value of the RNN layer updated last time, and the RNN layer outputs the updated intermediate state value;
the method is also used for judging whether the convolution output frame data of all frames are input completely, if the judgment result is negative, the convolution output frame data of the next frame is used as the convolution output frame data of the current frame, the updated intermediate state value is set as the intermediate state value of the RNN layer updated last time, and the step of inputting the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated last time into the RNN layer is returned; and if the judgment result is yes, taking the last state value of the RNN layer as an output state value.
Preferably, the identification module includes an initializing unit, where the initializing unit is configured to initialize a state of the RNN layer to an initial state value.
Another embodiment of the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a CRNN-based speech recognition method as described above when executing the computer program.
Another embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a CRNN-based speech recognition method as described above.
The invention has the positive progress effects that:
according to the invention, the voice data in the preset filter width value range from the current frame position of the preprocessing voice data is obtained as the processing voice data each time, the processing voice data is input into the convolution layer to obtain the convolution output frame number with one frame, and compared with the traditional method that the data of all the current frames of the preprocessing voice data are read to the convolution layer each time, the processing data size is greatly reduced, so that the calculation speed can be improved, the memory occupation space is reduced, and the real-time voice recognition of the terminal equipment can be realized.
Drawings
Fig. 1 is a flowchart of a CRNN-based voice recognition method according to embodiment 1 of the present invention.
Fig. 2 is a schematic flow chart of a conventional CRNN loop according to embodiment 1 of the present invention.
FIG. 3 is a schematic flow chart of the increment CRNN in the embodiment 1 of the invention.
Fig. 4 is a flowchart of step 105 of the CRNN-based voice recognition method of embodiment 2 of the present invention.
Fig. 5 is a schematic block diagram of a CRNN-based voice recognition system according to embodiment 3 of the present invention.
Fig. 6 is a schematic block diagram of a CRNN-based voice recognition system according to embodiment 4 of the present invention.
Fig. 7 is a schematic structural diagram of an electronic device according to embodiment 5 of the present invention.
Detailed Description
The invention is further illustrated by means of the following examples, which are not intended to limit the scope of the invention.
Example 1
The present embodiment provides a voice recognition method based on a CRNN, where the CRNN includes a convolution layer, an RNN layer, and a full connection layer, the convolution layer includes a filter bank, as shown in fig. 1, and the voice recognition method includes the steps of:
step 101, obtaining processed voice data, wherein the processed voice data is voice data in a preset filtering width value range from the current frame position of the voice data after preprocessing pointed by the filter bank.
Step 102, inputting the processed voice data into a convolution layer, and outputting by the convolution layer to obtain a frame of convolution output frame data.
Step 103, updating the orientation of the filter bank.
Step 104, determining whether the voice data of the preset feature length of the filter bank is obtained, if not, returning to step 101, and if yes, executing step 105.
Step 105, inputting the obtained convolution output frame data of all frames into an RNN layer, and outputting the output state value by the RNN layer;
and 106, inputting the output state value into a full-connection layer, and outputting a voice recognition result corresponding to the voice data with the preset characteristic length by the full-connection layer.
In the end device, in order to solve the interference (echo, reverberation, and interference sound source) from all directions, which is generally introduced by the surrounding environment and the propagation medium, a microphone array is used to capture voice data and preprocess the voice data, so as to obtain preprocessed voice data, and then the preprocessed voice data is transmitted to a voice recognition module, that is, a CRNN neural recognition network for performing voice recognition of a preselected beam. The method of the embodiment can be used for dynamically and real-time updating the voice data recognition in different practical application scenes, is also suitable for the voice recognition of historical data, and is not particularly limited herein.
The CRNN neural recognition network will now be described. In general, the input data of CRNN is a mfcc (mel frequency cepstrum coefficient) characteristic data stream with a fixed frame number, that is, the voice data after preprocessing, and the recognition result is finally obtained after a plurality of loops until the iteration is completed. In the iterative process, the voice characteristic data input in the front and back two loops are recognized in real time, and a part of overlapped data is generated in the voice characteristic data. The input speech feature data is convolved by the convolution layer to output frame data of a fixed frame, if the fixed frame number is N frames, the frame data is called full convolution, the N frame data is sequentially input to the RNN network layer to update and calculate an RNN state value, the frame data is called full RNN, and the last state value of the RNN network layer is used as an output state value and is input to the full connection layer to calculate to obtain a full connection layer output result.
Here, assuming that the input data of CRNN is 7 frames of characteristic data, the width value of the filter bank of the convolution layer is 3 frames, the convolution step in the time direction is 2 frames, and the number of layers of the convolution layer is 1.
In general, as shown in fig. 2, the processing flow of the conventional CRNN loop is assumed to be the first loop, the speech feature data 110 in the loop is input to the convolution layer 111, the condition that the width value of the convolution frame number, that is, the filter bank is 3 frames is determined, the speech feature data 110 is processed to be i1-i7, the filter bank is stopped at the i7 frame position after convolution as indicated by the arrow 112 in the figure, the 3-frame convolution output frame data o1-o3 is convolved, the o1-o3 data is input to the RNN layer 113 frame by frame to update the state (state value) of the RNN layer 113, and after the state of the RNN layer 113 is updated by the o3 frame, the value of the last state of the RNN layer 113 is output to the full connection layer 114 as the output state value to perform calculation.
Assuming that the speech feature data 110 is updated with two frames of data after each cycle, the speech feature data 110 updated with the second cycle is updated with i3-i9, in the conventional processing procedure in CRNN, the second cycle is to process i3-i9 of the speech feature data 110, it can be seen that the i3-i7 of the speech feature data 110 is overlapped compared with the first cycle in the second cycle, so that a great deal of repeated operation is involved, while applying the speech recognition method in this embodiment, i7-i9 three frames of data are fetched from the current position i7 of the filter bank, and the i3-i9 7 frames of data are not recalculated, as shown in fig. 3, the position indicated by arrow 115 in the figure is determined by the condition of the convolution frame number and the convolution is stopped at the i9 frame position, three frames of i7-i9 data are input to the convolution layer 116, the convolution layer 116 generates one frame of data o4, then updates the direction of the filter bank, continues to read new three frames of data from the direction of the updated filter bank and input to the convolution layer 116, the convolution layer 116 correspondingly outputs a new frame of convolution output frame data until the voice data of the preset feature length of the filter bank is obtained, and this part is called incremental convolution, and the CRNN corresponding to this method is defined as incremental CRNN.
According to the voice recognition method, voice data in the preset filter width value range from the current frame position of the pre-processing voice data pointed by the filter bank is obtained each time to serve as processing voice data, compared with the traditional method that data of all frames of the pre-processing voice data need to be read each time to the data of a convolution layer, the processing data size is greatly reduced, so that the calculation speed can be improved, the occupied space of a memory is reduced, and further real-time voice recognition of terminal equipment can be achieved.
Example 2
The present embodiment provides a voice recognition method based on CRNN, which is different from embodiment 1 in that, as shown in fig. 4, step 105 includes:
step 1051, initializing the RNN layer state to an initial state value.
Step 1052, the intermediate state value of the RNN layer updated in the previous time and the convolution output frame data of the current frame are input to the RNN layer, and the RNN layer outputs the updated intermediate state value.
Step 1053, judging whether the convolution output frame data of all frames is input, if not, executing step 1054; if yes, go to step 1055.
The convolution output frame data of all frames is data of all frames corresponding to the voice data of the preset characteristic length of the filter group.
Step 1054, the convolution output frame data of the next frame is taken as the convolution output frame data of the current frame, the updated intermediate state value is set to the intermediate state value of the RNN layer updated last time, and step 1052 is returned.
Step 1055, obtain the intermediate state value of the last layer as the output state value.
The method comprises the steps that voice data with preset characteristic length in the preprocessed voice data are sequentially input into a convolution layer according to preset convolution step length and with preset filter width as frame number units, corresponding multi-frame convolution output frame data are obtained, the multi-frame convolution output frame data are sequentially input into an RNN layer, and output state values are finally obtained respectively.
The usual pre-processed voice data includes a plurality of voice data sets with preset feature length, and whether all voice data are processed is judged, if not, the step 101 is returned to continue to circularly execute the subsequent recognition of the voice data until all voice data recognition is completed.
In the first cycle, we can use the conventional CRNN loop steps, because the first cycle is all data in the initial state, there is no large number of repeated operations, and the incremental CRNN can be applied to calculate from the second cycle.
It may be further optimized that the increment CRNN is used for calculation in the first cycle, where only the last two frames (i 6, i 7) of the preprocessed feature data are true preprocessed feature data, i1-i5 being an initial value of 0. If the width value of the filter bank of the preset convolution layer is 3 frames, at this time, the two frames of data of i6 and i7 do not meet the condition of the convolution frame number, namely, the width value of the filter bank is 3 frames, the convolution is not performed this time, when the second time is circulated, according to the condition that the voice characteristic data are updated by two frames of data after each time of circulation, the voice characteristic data are updated into data (i 4-i 7), the 4 frames of data meet the condition of the convolution frame number, one convolution calculation can be performed, so that the calculated amount is only needed in the second time, the calculated amount of the whole neural network is reduced to 0 in the first time, and the calculated amount is further reduced.
The output of the convolutional layer is in two ways:
1. continuing with the previous example, if the second cycle of the conventional approach, i.e., the delta CRNN approach, o2, o3, o4 may be output to the RNN layer frame by frame, which is consistent with the logic of the conventional CRNN, while indicating that the delta convolution and the full convolution results are consistent and that the delta convolution results are correct.
2. Only data o4 is output to the RNN layer.
Because RNN has the following characteristics:
the value of each time state is updated by convolution output frame data of the previous time state value and the current time, and the updating of each time is the same operation, so here we can discard the recalculation of o2 and o3, and directly use the last time state value and the current input o4 to perform one-time updating calculation. This part is called delta RNN, and then outputs the last state of the RNN layer to the full connection layer computation.
In the second mode, the calculated amount of the convolution layer in the current and subsequent cycles is 1/3 of the calculated amount of the RNN layer in the original traditional mode, and the calculated amount of the RNN layer is 1/3 of the calculated amount of the RNN layer. Furthermore, the calculated amount is further reduced, and the calculated performance is optimized.
For different dimensions m of the preprocessed voice data, the number u of updated frames, the width w of the filter, the convolution step s, the number p of convolution layers and the calculated amount optimization amount are different. In the case of s < u (ensuring that at least one convolution can be performed per cycle), the convolution in the increment CRNN, the calculation amount of the RNN part, does not exceed the calculation amount of the RNN part, i.e., (u+s)/(m-w) in the CRNN, and when s > =u, the calculation amount is smaller. Typically, the number of input feature data frames of a conventional CRNN, i.e. the dimension of the pre-processed speech data, is hundreds or even more, while the number of updated feature data frames per cycle is very small with respect to its dimension, so the computational optimization of the incremental CRNN is very considerable. When the number of convolution layers is larger than 1 layer, the person skilled in the art can similarly only convolve the input update data of the layer, so that each layer can be optimized in calculation amount.
The voice recognition method reduces the calculated amount of each cycle by improving the calculation logic of CRNN in real-time recognition. The convolution and RNN calculation amount is reduced by adopting incremental convolution and incremental RNN at the convolution layer and RNN layer, respectively, and is called incremental CRNN. The smaller calculated amount of the increment CRNN reduces the limitation on the parameter size of the CRNN model, so that a larger model with higher recognition rate can be used in real-time recognition, and the limitation on the distributable number and selectable number of the beams of the microphone array is reduced because the smaller calculated amount is occupied, and the influence on the positioning accuracy is reduced; similarly, the increment CRNN reduces the burden of the terminal equipment, so that the whole voice wake-up system can be operated smoothly.
Example 3
The present embodiment provides a voice recognition system based on a CRNN, where the CRNN includes a convolution layer, an RNN layer, and a full connection layer, the convolution layer includes a filter bank, and as shown in fig. 5, the voice recognition system includes an acquisition module 201, a convolution module 202, an update module 203, a recognition module 204, and a full connection module 205.
The obtaining module 201 is configured to obtain processed voice data, where the processed voice data is voice data within a preset filter width value range from a current frame position of the preprocessed voice data pointed by the filter bank.
The convolution module 202 is configured to input the processed voice data to a convolution layer, and the convolution layer outputs one frame of convolution output frame data.
The updating module 203 is configured to update the orientation of the filter bank;
the recognition module 204 is configured to determine whether the voice data of the preset feature length of the filter bank is obtained, if the determination result is no, call the obtaining module 201, if the determination result is yes, input the obtained convolved output frame data of all frames to the RNN layer, and the RNN layer outputs the output state value.
The full connection module 205 is configured to input the output status value to a full connection layer, and the full connection layer outputs a result of voice recognition corresponding to voice data with a preset feature length.
In the end device, in order to solve the interference (echo, reverberation, and interference sound source) from all directions, which is generally introduced by the surrounding environment and the propagation medium, a microphone array is used to capture voice data and preprocess the voice data, so as to obtain preprocessed voice data, and then the preprocessed voice data is transmitted to a voice recognition module, that is, a CRNN neural recognition network for performing voice recognition of a preselected beam. The method of the embodiment can be used for dynamically and real-time updating the voice data recognition in different practical application scenes, is also suitable for the voice recognition of historical data, and is not particularly limited herein.
The CRNN neural recognition network will now be described. In general, the input data of CRNN is a mfcc (mel frequency cepstrum coefficient) characteristic data stream with a fixed frame number, that is, the voice data after preprocessing, and the recognition result is finally obtained after a plurality of loops until the iteration is completed. In the iterative process, the voice characteristic data input in the front and back two loops are recognized in real time, and a part of overlapped data is generated in the voice characteristic data. The input voice characteristic data outputs frame data of fixed frames through convolution layer convolution, if the fixed frame number is N frames, the frame data is called full convolution, the N frame data is sequentially input into an RNN network layer to update and calculate an RNN state value, the frame data is called full RNN, and a final layer state value of the RNN network layer is input into a full connection layer to calculate to obtain a full connection layer output result.
Here, assuming that the input data of CRNN is 7 frames of characteristic data, the width value of the filter bank of the convolution layer is 3 frames, the convolution step in the time direction is 2 frames, and the number of layers of the convolution layer is 1.
In general, as shown in fig. 2, the conventional processing flow of the CRNN loop is assumed to be the first loop, the speech feature data 110 in the loop is input to the convolution layer 111, the condition that the width value of the convolution frame number, that is, the filter bank is 3 frames is determined, the speech feature data 110 is processed to be i1-i7, the filter bank is stopped at the i7 frame position after convolution as indicated by the arrow 112 in the figure, the convolution outputs 3 frames of convolution output frame data o1-o3, the o1-o3 data is input to the RNN layer 113 frame by frame to update the state (state value) of the RNN layer 113, and after the state of the RNN layer 113 is updated by the o3 frame, the value of the last state of the RNN layer 113 is output to the full connection layer 114 to perform calculation.
Assuming that the speech feature data 110 is updated with two frames of data after each cycle, the speech feature data 110 updated with the second cycle is updated with i3-i9, in the conventional processing procedure in CRNN, the second cycle is to process i3-i9 of the speech feature data 110, it can be seen that the i3-i7 of the speech feature data 110 is overlapped compared with the first cycle in the second cycle, so that a great deal of repeated operation is involved, while applying the speech recognition method in this embodiment, i7-i9 three frames of data are fetched from the current position i7 of the filter bank, and the i3-i9 7 frames of data are not recalculated, as shown in fig. 3, the position indicated by arrow 115 in the figure is determined by the condition of the convolution frame number and the convolution is stopped at the i9 frame position, three frames of i7-i9 data are input to the convolution layer 116, the convolution layer 116 generates one frame of data o4, then updates the direction of the filter bank, continues to read new three frames of data from the direction of the updated filter bank and input to the convolution layer 116, the convolution layer 116 correspondingly outputs a new frame of convolution output frame data until the voice data of the preset feature length of the filter bank is obtained, and this part is called incremental convolution, and the CRNN corresponding to this method is defined as incremental CRNN.
The voice recognition system takes the voice data in the preset filter width value range from the current frame position of the pre-processing voice data pointed by the filter bank as the processing voice data each time, and compared with the traditional method that the data of all the current frames of the pre-processing voice data need to be read to the data of the convolution layer each time, the processing data size is greatly reduced, so that the calculation speed can be improved, the memory occupation space is reduced, and the voice recognition can be realized at the terminal equipment in real time.
Example 4
The present embodiment provides a CRNN-based speech recognition system, which is different from embodiment 3 in that the speech recognition system further includes an initialization unit 2041 as shown in fig. 6.
More specifically, the identifying module 204 is further configured to input the intermediate state value of the RNN layer updated last time and the convolutionally output frame data of the current frame, and the RNN layer outputs the updated intermediate state value; and the step of judging whether the convolution output frame data of all frames are input completely, if the judgment result is no, taking the convolution output frame data of the next frame as the convolution output frame data of the current frame, setting the updated intermediate state value as the intermediate state value of the RNN layer updated last time, and returning to input the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated last time into the RNN layer; if the judgment result is yes, the intermediate state value of the last layer is taken as the output state value.
The initialization unit 2041 is for initializing the state of the RNN layer to an initial state value. The general voice data comprises a plurality of groups of voice data with preset characteristic length, whether all voice data are processed is judged, if not, the call acquisition module is returned, and the subsequent recognition of the voice data is continuously and circularly executed until all voice data are recognized.
In the first cycle, we can use the cyclic steps of the conventional CRNN because the first cycle does not have a large number of repeated operations, and the incremental CRNN can be used to calculate from the second cycle.
It may be further optimized that the increment CRNN is used for calculation in the first cycle, where only the last two frames (i 6, i 7) of the preprocessed feature data are true preprocessed feature data, i1-i5 being an initial value of 0. If the preset is the former, the two frames of data i6 and i7 do not meet the condition of the convolution frame number, namely the width value of the filter bank is 3 frames, the convolution is not performed this time, when the second cycle is performed, according to the former condition, the voice characteristic data are assumed to be updated by the two frames of data (i 4-i 7) after each cycle, the 4 frames of data meet the condition of the convolution frame number, and one convolution calculation can be performed, so that the calculated amount is only calculated in the second cycle, the calculated amount of the whole neural network is reduced to 0 in the first cycle, and the calculated amount is further reduced.
The output of the convolutional layer is in two ways:
1. continuing with the previous example, if the second cycle of the conventional approach, i.e., the delta CRNN approach, o2, o3, o4 may be output to the RNN layer frame by frame, which is consistent with the logic of the conventional CRNN, while indicating that the delta convolution and the full convolution results are consistent and that the delta convolution results are correct.
2. Only data o4 is output to the RNN layer.
Because RNN has the following characteristics:
the value of each time state is updated by convolution output frame data of the previous time state value and the current time, and the updating of each time is the same operation, so here we can discard the recalculation of o2 and o3, and directly use the last time state value and the current input o4 to perform one-time updating calculation. This part is called delta RNN, and then outputs the last state of the RNN layer to the full connection layer computation.
In the second mode, the calculation amount of the convolution layer in the current and subsequent cycles is 1/3 of the original calculation amount of the rnn layer, and the calculation amount of the rnn layer is 1/3 of the original calculation amount. Furthermore, the calculated amount is further reduced, and the calculated performance is optimized.
For different dimensions m of the preprocessed voice data, the number u of updated frames, the width w of the filter, the convolution step s, the number p of convolution layers and the calculated amount optimization amount are different. In the case of s < u (ensuring that at least one convolution can be performed per cycle), the convolution in the increment CRNN, the calculation amount of the RNN part, does not exceed the calculation amount of the RNN part, i.e., (u+s)/(m-w) in the CRNN, and when s > =u, the calculation amount is smaller. Typically, the number of input feature data frames of a conventional CRNN, i.e. the dimension of the pre-processed speech data, is hundreds or even more, while the number of updated feature data frames per cycle is very small with respect to its dimension, so the computational optimization of the incremental CRNN is very considerable. When the number of convolution layers is larger than 1 layer, the person skilled in the art can similarly only convolve the input update data of the layer, so that each layer can be optimized in calculation amount.
The speech recognition system reduces the amount of computation per cycle by improving the computational logic of the CRNN in real-time recognition. The convolution and RNN calculation amount is reduced by adopting incremental convolution and incremental RNN at the convolution layer and RNN layer, respectively, and is called incremental CRNN. The smaller calculated amount of the increment CRNN reduces the limitation on the parameter size of the CRNN model, so that a larger model with higher recognition rate can be used in real-time recognition, and the limitation on the distributable number and selectable number of the beams of the microphone array is reduced because the smaller calculated amount is occupied, and the influence on the positioning accuracy is reduced; similarly, the increment CRNN reduces the burden of the terminal equipment, so that the whole voice wake-up system can be operated smoothly.
Example 5
Fig. 7 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention. The electronic device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the CRNN-based speech recognition method of embodiment 1 when executing the program. The electronic device 30 shown in fig. 7 is only an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 7, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be a server device, for example. Components of electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, a bus 33 connecting the different system components, including the memory 32 and the processor 31.
The bus 33 includes a data bus, an address bus, and a control bus.
Memory 32 may include volatile memory such as Random Access Memory (RAM) 321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.
Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The processor 31 executes various functional applications and data processing, such as the CRNN-based voice recognition method provided in embodiment 1 of the present invention, by running a computer program stored in the memory 32.
The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through an input/output (I/O) interface 35. Also, model-generating device 30 may also communicate with one or more networks, such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet, via network adapter 36. As shown, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in connection with the model-generating device 30, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like.
It should be noted that although several units/modules or sub-units/modules of an electronic device are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present invention. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.
Example 6
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the CRNN-based speech recognition method provided in embodiment 1.
More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible embodiment, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of implementing the CRNN-based speech recognition method of embodiment 1 when the program product is run on the terminal device.
Wherein the program code for carrying out the invention may be written in any combination of one or more programming languages, the program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device, partly on a remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims (6)

1. A speech recognition method based on a CRNN, the CRNN including a convolution layer, an RNN layer, and a full connection layer, the convolution layer including a filter bank, the speech recognition method comprising:
acquiring processed voice data, wherein the processed voice data is voice data in a preset filtering width value range from the current frame position of the voice data after preprocessing pointed by the filter bank;
inputting the processed voice data into the convolution layer, and outputting by the convolution layer to obtain one frame of convolution output frame data;
updating the orientation of the filter bank;
judging whether the voice data of the preset characteristic length of the filter bank is obtained or not, if the judgment result is negative, returning to the step of obtaining the processed voice data, if the judgment result is positive, inputting the obtained convolution output frame data of all frames into the RNN layer, and outputting by the RNN layer to obtain an output state value;
inputting the output state value to the full connection layer, and outputting a voice recognition result corresponding to the voice data with the preset characteristic length by the full connection layer;
the step of inputting the obtained convolution output frame data of all frames into the RNN layer, and outputting the output state value by the RNN layer comprises the following steps:
inputting the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated in the previous time to the RNN layer, and outputting the updated intermediate state value by the RNN layer;
judging whether the convolution output frame data of all frames are input completely, if the judgment result is no, taking the convolution output frame data of the next frame as the convolution output frame data of the current frame, setting the updated intermediate state value as the intermediate state value of the RNN layer updated last time, and returning to the step of inputting the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated last time into the RNN layer; and if the judgment result is yes, taking the last state value of the RNN layer as the output state value.
2. The CRNN-based voice recognition method as set forth in claim 1, wherein the step of inputting the convolved output frame data of the current frame and the intermediate state value of the RNN layer updated last time to the RNN layer further comprises:
the state of the RNN layer is initialized to an initial state value.
3. A voice recognition system based on a CRNN, wherein the CRNN comprises a convolution layer, an RNN layer and a full connection layer, and the convolution layer comprises a filter bank, and the voice recognition system is characterized by comprising an acquisition module, a convolution module, an updating module, a recognition module and a full connection module;
the acquisition module is used for acquiring processed voice data, wherein the processed voice data is voice data in a preset filter width value range from the current frame position of the voice data after preprocessing pointed by the filter bank;
the convolution module is used for inputting the processed voice data into the convolution layer, and the convolution layer outputs and obtains one frame of convolution output frame data;
the updating module is used for updating the direction of the filter bank;
the recognition module is used for judging whether voice data of the preset characteristic length of the filter bank are acquired or not, if the judgment result is negative, the acquisition module is called, if the judgment result is positive, the obtained convolution output frame data of all frames are input to the RNN layer, and the RNN layer outputs to obtain an output state value;
the full connection module is used for inputting the output state value into the full connection layer, and the full connection layer outputs a voice recognition result corresponding to the voice data with the preset characteristic length;
the recognition module is further configured to input the intermediate state value of the RNN layer updated last time and the convolutionally output frame data of the current frame to the RNN layer, where the RNN layer outputs the updated intermediate state value;
the method is also used for judging whether the convolution output frame data of all frames are input completely, if the judgment result is negative, the convolution output frame data of the next frame is used as the convolution output frame data of the current frame, the updated intermediate state value is set as the intermediate state value of the RNN layer updated last time, and the step of inputting the convolution output frame data of the current frame and the intermediate state value of the RNN layer updated last time into the RNN layer is returned; and if the judgment result is yes, taking the last state value of the RNN layer as an output state value.
4. The CRNN-based speech recognition system of claim 3, wherein the recognition module comprises an initialization unit to initialize a state of the RNN layer to initialA start state value.
5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the CRNN-based speech recognition method according to claim 1 or 2 when the computer program is executed by the processor.
6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the CRNN-based speech recognition method according to claim 1 or 2.
CN201910177117.XA 2019-03-08 2019-03-08 Voice recognition method, system, storage medium and electronic equipment based on CRNN Active CN111667819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910177117.XA CN111667819B (en) 2019-03-08 2019-03-08 Voice recognition method, system, storage medium and electronic equipment based on CRNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910177117.XA CN111667819B (en) 2019-03-08 2019-03-08 Voice recognition method, system, storage medium and electronic equipment based on CRNN

Publications (2)

Publication Number Publication Date
CN111667819A CN111667819A (en) 2020-09-15
CN111667819B true CN111667819B (en) 2023-09-01

Family

ID=72382405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910177117.XA Active CN111667819B (en) 2019-03-08 2019-03-08 Voice recognition method, system, storage medium and electronic equipment based on CRNN

Country Status (1)

Country Link
CN (1) CN111667819B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112259120B (en) * 2020-10-19 2021-06-29 南京硅基智能科技有限公司 Single-channel human voice and background voice separation method based on convolution cyclic neural network
CN112259080B (en) * 2020-10-20 2021-06-22 北京讯众通信技术股份有限公司 Speech recognition method based on neural network model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1426048A (en) * 2001-12-13 2003-06-25 中国科学院自动化研究所 End detection method based on entropy
CN103995292A (en) * 2014-06-09 2014-08-20 桂林电子科技大学 Transient electromagnetic early signal reconstruction method
CN106448696A (en) * 2016-12-20 2017-02-22 成都启英泰伦科技有限公司 Adaptive high-pass filtering speech noise reduction method based on background noise estimation
CN108009635A (en) * 2017-12-25 2018-05-08 大连理工大学 A kind of depth convolutional calculation model for supporting incremental update
CN108847244A (en) * 2018-08-22 2018-11-20 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Voiceprint recognition method and system based on MFCC and improved BP neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106710589B (en) * 2016-12-28 2019-07-30 百度在线网络技术(北京)有限公司 Speech Feature Extraction and device based on artificial intelligence
KR102415508B1 (en) * 2017-03-28 2022-07-01 삼성전자주식회사 Convolutional neural network processing method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1426048A (en) * 2001-12-13 2003-06-25 中国科学院自动化研究所 End detection method based on entropy
CN103995292A (en) * 2014-06-09 2014-08-20 桂林电子科技大学 Transient electromagnetic early signal reconstruction method
CN106448696A (en) * 2016-12-20 2017-02-22 成都启英泰伦科技有限公司 Adaptive high-pass filtering speech noise reduction method based on background noise estimation
CN108009635A (en) * 2017-12-25 2018-05-08 大连理工大学 A kind of depth convolutional calculation model for supporting incremental update
CN108847244A (en) * 2018-08-22 2018-11-20 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Voiceprint recognition method and system based on MFCC and improved BP neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于稀疏卷积特征和相关滤波的实时视觉跟踪算法;熊昌镇;车满强;王润玲;;计算机应用;第38卷(第08期);第2176-2179、2223页 *

Also Published As

Publication number Publication date
CN111667819A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
US10650807B2 (en) Method and system of neural network keyphrase detection
CN112699991A (en) Method, electronic device, and computer-readable medium for accelerating information processing for neural network training
US20220083868A1 (en) Neural network training method and apparatus, and electronic device
US20230368807A1 (en) Deep-learning based speech enhancement
WO2019136909A1 (en) Voice living-body detection method based on deep learning, server and storage medium
CN110853663A (en) Speech enhancement method based on artificial intelligence, server and storage medium
JP6818372B2 (en) Noise Removal Variational Auto-Encoder Platform Integrated Training Methods and Equipment for Speech Detection
CN111667819B (en) Voice recognition method, system, storage medium and electronic equipment based on CRNN
CN115362497A (en) Sequence-to-sequence speech recognition with delay threshold
CN111066082A (en) Voice recognition system and method
CN112163601A (en) Image classification method, system, computer device and storage medium
JP5060006B2 (en) Automatic relearning of speech recognition systems
JP2020086436A (en) Decoding method in artificial neural network, speech recognition device, and speech recognition system
CN114612749A (en) Neural network model training method and device, electronic device and medium
CN112559721B (en) Method, device, equipment, medium and program product for adjusting man-machine dialogue system
US20200090657A1 (en) Adaptively recognizing speech using key phrases
CN117296061A (en) Diffusion model with improved accuracy and reduced computing resource consumption
US20230051625A1 (en) Method and apparatus with speech processing
CN114005452A (en) Method and device for extracting voice features, electronic equipment and storage medium
CN112287950B (en) Feature extraction module compression method, image processing method, device and medium
CN111968620A (en) Algorithm testing method and device, electronic equipment and storage medium
CN113196232A (en) Neural network scheduling method and device, computer equipment and readable storage medium
EP3905240A1 (en) Speech recognition of overlapping segments
CN114037772A (en) Training method of image generator, image generation method and device
Brakel et al. Bidirectional truncated recurrent neural networks for efficient speech denoising

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant