CN110992941A - Power grid dispatching voice recognition method and device based on spectrogram - Google Patents
Power grid dispatching voice recognition method and device based on spectrogram Download PDFInfo
- Publication number
- CN110992941A CN110992941A CN201911004454.5A CN201911004454A CN110992941A CN 110992941 A CN110992941 A CN 110992941A CN 201911004454 A CN201911004454 A CN 201911004454A CN 110992941 A CN110992941 A CN 110992941A
- Authority
- CN
- China
- Prior art keywords
- spectrogram
- network
- power grid
- convolution
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 239000011159 matrix material Substances 0.000 claims abstract description 37
- 238000009826 distribution Methods 0.000 claims abstract description 17
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 13
- 238000013507 mapping Methods 0.000 claims abstract description 9
- 238000010183 spectrum analysis Methods 0.000 claims abstract description 7
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 5
- 230000002457 bidirectional effect Effects 0.000 claims description 12
- 238000012549 training Methods 0.000 description 32
- 230000015654 memory Effects 0.000 description 19
- 238000011176 pooling Methods 0.000 description 18
- 230000008569 process Effects 0.000 description 12
- 230000006870 function Effects 0.000 description 10
- 230000007246 mechanism Effects 0.000 description 10
- 238000012545 processing Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 239000013598 vector Substances 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000001228 spectrum Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 239000000284 extract Substances 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000037433 frameshift Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000008034 disappearance Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention relates to a power grid dispatching voice recognition method and device based on a spectrogram, wherein continuous power grid dispatching voice to be recognized is converted into the spectrogram for representation by performing spectrum analysis on the power grid dispatching voice to be recognized; the Incep convolution structure in the image recognition algorithm is used for extracting convolution characteristics of the spectrogram image of the input power grid dispatching language, and an extraction result is converted into a convolution characteristic matrix; continuously extracting sequence characteristics on the basis of the convolution characteristics extracted by the convolution network through a circulation network, and outputting the sequence characteristics as a probability distribution matrix; aligning input and output by using a CTC (central control unit) for an output probability distribution matrix predicted by a cyclic network, and mapping the input and output into characters through a num2word of a dictionary; and checking and outputting the character result recognized by the power grid dispatching voice. The invention ensures consistent pronunciation and character by improving the accuracy of voice recognition, thereby reducing the intensity and pressure of operation of a dispatcher.
Description
Technical Field
The invention belongs to the field of electric power system scheduling, relates to man-machine intelligent interaction, and particularly relates to a power grid scheduling voice recognition method and device based on a spectrogram.
Background
With the explosive growth of new energy and distributed photovoltaic projects, the power grid is connected into, and the operation of the power grid is influenced by various uncertain factors. The operation intensity and efficiency of the dispatcher face huge pressure, and artificial intelligence is the optimal choice for solving the problems. The optimal selection of the intelligent operation of the power grid dispatching is the voice recognition technology.
In the traditional speech recognition technology, various artificially designed filter banks are used for extracting features after Fourier transform, so that information loss in a frequency domain is caused, the information loss in a high-frequency region is particularly obvious, in addition, the traditional speech features need to adopt very large frame shift for the consideration of calculated amount, so that the information loss in a time domain is undoubtedly caused, and the speech recognition technology is more remarkably shown when the speaking speed is fast.
The speech recognition can be realized by a machine learning classification algorithm combined with the traditional speech recognition manual characteristics of power grid scheduling, but the generated Chinese sentences have poor continuity and readability no matter from an objective angle or a subjective angle. The application of the deep learning network improves the problem, but the defects of low recognition rate and the like still exist.
Through a search for a patent publication, no patent publication that is the same as the present patent application is found.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a power grid dispatching voice recognition method and device based on a spectrogram.
The technical problem to be solved by the invention is realized by the following technical scheme:
a power grid dispatching voice recognition method based on a spectrogram is characterized in that: the method comprises the following steps:
firstly, performing spectrum analysis on the power grid scheduling voice to be recognized, and converting continuous power grid scheduling voice to be recognized into a spectrogram to represent;
step two, utilizing an inclusion convolution structure in an image recognition algorithm to extract convolution characteristics of the spectrogram image of the input power grid dispatching language, and converting an extraction result into a convolution characteristic matrix;
thirdly, continuously extracting sequence characteristics on the basis of the convolution characteristics extracted by the convolution network through a circulation network, and outputting the sequence characteristics as a probability distribution matrix;
step four, aligning input and output by using CTC for an output probability distribution matrix predicted by the cyclic network, and mapping the input and output into characters through a dictionary num2 word;
and fifthly, verifying and outputting the character result recognized by the power grid dispatching voice.
Furthermore, the circulating network is a deep network bidirectional LSTM network.
The utility model provides a power grid dispatching speech recognition device based on spectrogram which characterized in that: the system comprises the following modules:
the speech spectrogram acquiring module is used for performing spectral analysis on the speech to be recognized to obtain a speech spectrogram of the speech to be recognized;
the feature extraction module is used for extracting features of the spectrogram by utilizing an inclusion convolution structure in an image recognition algorithm to obtain voice features of the voice to be recognized, and fusing the features of the spectrogram by adopting an image feature extraction network to obtain a convolution feature matrix of the spectrogram;
the sequence extraction module is used for purifying sequence characteristics by using a trained bidirectional LSTM-based coder-decoder to obtain an output probability distribution matrix;
the alignment module is used for carrying out input and output alignment mapping on the output probability distribution matrix and converting the result into characters through a dictionary num2 word;
and the output module is used for outputting the voice recognition result.
The invention has the advantages and beneficial effects that:
according to the power grid dispatching voice recognition method and device based on the spectrogram, the power grid dispatching spectrogram is directly used as input, and the method and device have natural advantages compared with other voice recognition frameworks which take traditional voice characteristics as input. Secondly, from the view of a model structure, the method is different from the CNN method in the traditional voice recognition, and the accuracy rate is enhanced by utilizing the Incep convolution structure to process the spectrogram. Finally, the invention utilizes CTC to realize end-to-end training of the whole model, and the special structure of the contained pooling network and the like can make the training more stable.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a power grid dispatching spectrogram of the present invention;
FIG. 3 is a graph of a convolution network model of a spectrogram image of the present invention;
FIG. 4 is a diagram of the Incep-Resnet-A structure of the present invention;
FIG. 5 is a structural view of Reduction A according to the present invention;
FIG. 6 is a diagram of the structure of BilSTM according to the present invention;
FIG. 7 is a flow chart of another method of the present invention;
fig. 8 is a structure diagram of a speech recognition device based on spectrogram for power grid dispatching.
Detailed Description
The present invention is further illustrated by the following specific examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The following describes a power grid dispatching voice recognition method and device based on a spectrogram according to an embodiment of the invention with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of a power grid dispatching voice recognition method based on a spectrogram according to an embodiment of the present invention. The power grid dispatching voice recognition method based on the spectrogram comprises the following steps of:
s101, carrying out spectrum analysis on the power grid dispatching voice to be recognized to obtain a power grid dispatching voice spectrogram of the voice to be recognized.
Specifically, the power grid dispatching to-be-recognized voice is sampled according to a preset period, and each voice frame of the power grid dispatching to-be-recognized voice is obtained. In this embodiment, filter banks are preset, and a preset number of filters in each filter bank set different filtering frequencies for different filters. Furthermore, each speech frame is filtered by using a filter bank comprising a preset number of filters, and each frequency spectrum component included in each speech frame can be obtained after the speech frame is filtered by the filter bank because the filtering of each filter in the filter bank is different.
Further, after obtaining each frequency spectrum component of each speech frame, performing fast fourier transform on each speech frame to obtain a frequency spectrum value of each speech frame, and representing the short-time average energy of the speech frame by the frequency spectrum value. And then generating a spectrogram of the voice to be recognized by using the obtained frequency spectrum value of each voice frame. The horizontal coordinate of a spectrogram of the voice to be recognized is time corresponding to the voice frame, the vertical coordinate of the spectrogram is a frequency component contained in the voice frame, and the coordinate point value of the spectrogram is a frequency spectrum value.
For example, 10ms is preset as a period of one sample, and one speech frame is formed every 10 ms. And setting a filter bank comprising 40 filters, and filtering each 10ms voice frame through the filter bank to obtain the filterbank characteristic of each voice frame. Because the filter bank comprises 40 filters, after passing through the filter bank, each speech frame can extract 40 filter-bank features, and a section of continuous speech to be recognized can form a two-dimensional image format according to the extracted 40 filter-bank features, that is, a spectrogram for dispatching the speech to be recognized by the power grid is formed, as shown in fig. 2.
S102, an inclusion convolution structure in an image recognition algorithm is used for extracting convolution characteristics of the input power grid dispatching spectrogram image, and an extraction result is converted into a convolution characteristic matrix to obtain power grid dispatching spectrogram image characteristics.
In this embodiment, in order to improve the accuracy of speech recognition, an inclusion convolution structure in an image recognition algorithm may be used in the acoustic recognition model to extract convolution features of an input power grid dispatching spectrogram image, and an extraction result is converted into a convolution feature matrix. Specifically, the inclusion convolution structure includes a plurality of convolution modules, and each convolution module includes a plurality of convolution networks and a pooling network. Wherein the pooled network is used for down-sampling in the time and/or frequency domain.
The Incep feature extraction network is formed by sequentially connecting a plurality of Incep-Resnet modules with attention mechanisms, and each Incep-Resnet module with attention mechanisms is formed by sequentially connecting an improved Incep-Resnet unit, a convolution network with attention mechanisms and a pooling network.
In this embodiment, a general structural diagram of the spectrogram image feature extraction network is shown in fig. 3, the inclusion-Resnet feature extraction network is formed by sequentially connecting a plurality of inclusion-Resnet modules with a attention mechanism, and the attention mechanism is mainly embodied in convolution operation in the inclusion-Resnet modules. The expression-Resnet module with attention mechanism is based on the expression mechanism of Googlenet and the bypass strategy of Resnet, and the expression-Resnet unit used in the embodiment mainly comprises a Stem network, 5 expression-Resnet-A networks, a ReducionA network, 10 expression-Resnet-A networks, a ReducionA network, 5 expression-Resnet-A networks, an average pooling network and a Dropout network which are connected in sequence.
The inclusion-Resnet-a network uses multi-branch convolution, reduces the amount of computation using 1 × 1 convolution, and the combination of convolution kernels of different sizes can extract features in more sizes: the ReductionA network samples the feature map using a convolution operation with step size 2. The Stem network, the inclusion-Resnet-A network and the Reduction-A network are all stacks suitable for parallel convolution operation and pooling operation, and due to the design of multiple branches, different branches can be calculated in parallel during model implementation, and training efficiency is improved.
The internal result of the inclusion-Resnet-A network is shown in FIG. 4, and a strategy of paralleling four lines is used, so that the diversity of features is greatly enriched, and the reliability of the features is increased. The symbol "+" in the circle in the figure represents the operation of the feature combination, the results of the feature combination and the feature addition ensure the integrity of the features extracted by convolution kernels with different sizes, the direct connection structure embodies the idea of residual errors in the Resnet network, and when the network depth is deepened, the direct connection structure can avoid the gradient disappearance to a certain extent, so that the updating of shallow network parameters is promoted.
Fig. 5 shows the internal structure of the reduction a network connected to the inclusion-respet-a network. The ReductonA network works mainly to reduce the features extracted in the last step. Here, a parallel strategy is also used, which is composed of an average pooling network, a single-network convolutional network, and a multi-network convolutional network. Parameters which can be manually adjusted in the model training process aim at enriching the diversity of the model training and reducing the information loss brought by the reduction process.
Between the individual inclusion-Resnet modules, a convolutional network with attention mechanism is used for the transition. The training mode of the Incep-Resnet modules is module-by-module pre-training, the weight of each Incep-Resnet module of the network is pre-trained, and the convergence speed in the process of training the complete model is increased. During model training, the output of the former inclusion-Resnet module is transformed into a feature map with a specific channel number by using a three-dimensional 1 x 1 convolution kernel before convolution, wherein each channel corresponds to a key feature bit. For an image, the weight of each pixel position is stored by using a matrix with the same size, and a specific area of the weight matrix is assigned with larger weight through training. And obtaining a weight matrix which embodies the attention mechanism at the joint of each increment-Resnet module after training is finished. When the test is carried out, the weight matrix of each connection position and the output of the last network are used for operation, and the influence of a specific area on the feature vector is strengthened. It should be noted that, during the test, the output of the previous network is first converted to a specific number of channels, so as to be able to perform the operation with the weight matrix, and after the operation is finished, the operation result needs to be restored to the original format to satisfy the requirement of the input of the next network. And finally obtaining the characteristic vector of the power grid dispatching spectrogram after the original image is subjected to multiple operations of the modules.
S103, continuously extracting sequence features on the basis of the convolution features extracted by the convolution network through a circulation network, namely a deep network bidirectional LSTM network.
And taking all power grid dispatching spectrogram image characteristics as input, training a BiLSTM constructed RNN coder-decoder structure, finally obtaining a sequence characteristic extraction network formed by connecting a spectrogram image characteristic extraction network and a trained bidirectional LSTM coder-decoder, and finally obtaining a probability distribution matrix through softmax.
Further as a preferred embodiment, referring to fig. 6, the encoder and the decoder of the bidirectional LSTM encoder-decoder each include a plurality of LSTM cycle units, and an output of each LSTM cycle unit of the encoder includes a connection weight with a plurality of LSTM cycle units of the decoder.
The present embodiment employs a bi-directional LSTM codec as a means of refining the sequence vector. The specific structure of a bi-directional LSTM based codec is shown in fig. 6. Both the encoder and decoder are constructed of bi-directional LSTM cyclic units. The figure uses simple two-way arrows to show the concept of two-way, however, in the actual construction, two-way training of an LSTM network is used, which can be understood as two "opposite LSTM". After one forward propagation during training, the input sequence is then input backward into the LSTM. Taking the current position as a reference point, the output of each current state depends on the current input and the part in front of the "forward" LSTM and the back of the "backward" LSTM. The decoder has the same structure as the encoder. In the training process of the bidirectional LSTM coder-decoder, a BPTT back propagation algorithm is adopted for training.
S104, aligning input and output by using CTC for the output probability distribution matrix predicted by the cyclic network, and then mapping the input and output into characters through a dictionary num2 word.
In this embodiment, in order to improve the accuracy of speech recognition, the CTC is used to solve the problem of input and output correspondence, and then the words are mapped into a dictionary num2 word. CTC is a Loss computing method, and replaces SoftmaxLos with CTC. CTC has the advantage of introducing blank characters, solving the problem of having no characters in some locations. By recursion, the gradient is calculated quickly.
The RNN is followed by the CTC algorithm. The input of the RNN model is audio segments, the output number is the same as the input dimensionality, T audio segments are provided, probability vectors of T dimensionalities are output, and each vector consists of the probability of the number of dictionaries. For example, if the number of network input audios is T and the number of different words in the dictionary is N, the dimension of RNN output is T × N. Based on this probability output distribution, we can get the most possible output result.
For a given input X, the training model maximizes the expected posterior probability P (Y | X) of Y, which can be derived, and then the model is trained using gradient descent.
The CTC algorithm can provide a large number of Y conditional probability outputs for an input X (RNN output probability distribution matrix, where a large number of Y values can be obtained as final outputs by combining elements in the matrix), and the CTC algorithm is used to align the inputs and outputs in the process of computing the outputs.
The loss function is calculated, and the goal of CTC is to maximize the probability of the following for a pair of input outputs.
For the model, the RNN outputs the probability of pt (at | X), and t represents the notion of time within the RNN. Multiplication means all character probabilities of one path are multiplied and addition means multiple paths. Since CTC aligned input output is many-to-one a conditional probability that adding all paths is the output. The calculation of P (Y | X) is derivable since it involves only addition and multiplication. The goal of model optimization is to minimize the negative log-likelihood function.
And (4) obtaining a node with the maximum RNN output probability each time through greedy algorithm and cluster searching, and then obtaining an output result through de-duplication.
The method comprises the steps of carrying out word segmentation processing on a speech recognition and annotation set by adopting a Chinese word segmentation component, establishing a word frequency dictionary according to the occurrence frequency of each word after word segmentation from high to low, and establishing an index by giving a unique number to each word. And establishing a single hidden network neural network model to extract word embedding characteristics of the description set, and mapping the image characteristic descriptors to a word embedding space.
And performing one-hot coding on each word in the word frequency dictionary according to the word frequency dictionary scale. A word embedding matrix is initialized randomly, the row number of the matrix represents the total number of words in the dictionary, and the column number of the matrix represents the dimension of the word embedding characteristic vector. And establishing a single hidden network neural network model, and performing product operation on the input one-hot vector and the word embedding matrix to obtain the word embedding characteristic vector of each word. And splicing the word embedded matrix to be used as hidden network input of a neural network, finally outputting and jointly constructing a cross entropy loss function, and optimizing by adopting a back propagation algorithm to obtain a final word embedded matrix.
And S105, outputting the character result obtained by the power grid dispatching voice recognition.
The results are evaluated and are not specifically limited herein. And outputting the voice recognition result.
The method utilizes AISHELL data set to carry out model base training. And meanwhile, a real recording data set is scheduled by adopting a power grid for training. The data set is a power grid dispatching telephone recording set which is from different areas in China and has different accents; recording is carried out in a real power grid dispatching center by using a high-fidelity microphone, and sampling is reduced to 16kHz to manufacture a power grid dispatching scene voice data set.
The power grid dispatching voice recognition method based on the spectrogram is disclosed by the embodiment of the invention. The traditional voice characteristics use various artificially designed filter banks to extract characteristics after Fourier transform, which causes information loss in a frequency domain, the information loss in a high-frequency region is particularly obvious, and the traditional voice characteristics need to adopt very large frame shift for calculation amount consideration, which undoubtedly causes information loss in a time domain, and the traditional voice characteristics are more prominently represented when the speaking speed is higher. Therefore, the power grid dispatching spectrogram is directly used as input, and the method has natural advantages compared with other voice recognition frameworks which take traditional voice characteristics as input. Secondly, from the view of a model structure, the method is different from the CNN method in the traditional voice recognition, and the accuracy rate is enhanced by utilizing the Incep convolution structure to process the spectrogram. Finally, the invention utilizes CTC to realize end-to-end training of the whole model, and the special structure of the contained pooling network and the like can make the training more stable.
Fig. 7 is a schematic flow chart of another power grid scheduling speech recognition method based on a spectrogram according to an embodiment of the present invention. The method comprises the following steps:
s201, constructing a data generator and constructing a model network structure.
In this embodiment, the data generator is constructed in advance, and is used for generating training data, including generating input features and labels, and converting the data into a specific format. And constructing a model network structure for final training and recognition setting.
Preferably, the preset data format is arranged into a format which can be accepted by the network, and the data format is modified into a format which can be processed by the batch. And the construction data generator is used for generating model training data. The data format sets formats such as voice file format (pcm (uncompressed), wav (uncompressed, pcm coded), amr (compressed format)) and 8k/16k sampling rate, 16bit depth, mono, etc., which will not be described herein. The data generator can refer to the record of the generator and the like, and the details are not repeated here.
Preferably, a network structure is constructed for final training and recognition. Sequentially constructing a CNN-inclusion convolution structure, an RNN-BilSTM network structure, a CTC loss calculation network and an input/output interface.
Preferably, the inclusion convolution structure includes a plurality of convolution modules. The convolution modules respectively comprise a shunt network, 4 convolution networks with convolution kernels of 1 x 1, 1 convolution network with convolution kernels of 3 x 3, 1 convolution network with convolution kernels of 5 x 1, 1 convolution network with convolution kernels of 1 x 5, a pooling network and a characteristic splicing network, wherein the convolution network with convolution kernels of 5 x 1 and the convolution network with convolution kernels of 1 x 5 form the convolution network with convolution kernels of 5 x 5, the scale of the pooling network is 2 x 2, and the convolution step is 1.
Preferably, the encoder and decoder of the bidirectional LSTM encoder-decoder each comprise a plurality of LSTM cycle units, and the output of each LSTM cycle unit of the encoder comprises a connection weight with a plurality of LSTM cycle units of the decoder. Here, the connection weight indicates a range of interest for a plurality of LSTM cycle units of the decoder, and a higher connection weight indicates a higher degree of correlation, for example, a connection weight of 0 indicates no degree of correlation.
Preferably, a CTC loss function is constructed, and a gradient calculation model is constructed. A gradient calculation model of the loss function with respect to the (non-normalized) output probability is constructed, and a back propagation model is constructed.
S202, training the set network structure.
In this embodiment, training data converted into a preset data structure is imported, a pre-constructed model is trained, and feedback correction is performed on the model.
Preferably, the audio files are processed, including generating a general file list, feature extraction, and the like. And processing the text label file, including generating a mapping from pinyin to numbers and converting the pinyin label into a label conversion of the numbers.
Preferably, data required by training is prepared, a model structure is imported, a model is trained, and model parameters are saved.
Preferably, the matrix of model predicted softmax is decoded using CTC criteria and then converted to text by the dictionary num2 word.
Preferably, the model is tested, test data is prepared, a dictionary is generated, the trained model is loaded, recognition is performed, and the numerical result is converted into a text result.
S203, carrying out spectrum analysis on the power grid dispatching voice to be recognized to obtain a power grid dispatching voice recognition spectrogram.
For the acquisition process of the spectrogram, reference may be made to the description of the relevant contents in the above embodiments, which is not repeated herein.
S204, extracting the features of the power grid dispatching spectrogram by using a pre-constructed inclusion convolution structure to obtain the voice features of the voice to be recognized.
After a spectrogram of the voice to be recognized in power grid dispatching is obtained, the spectrogram is input into an inclusion convolution structure, the inclusion convolution structure can recognize the spectrogram, and the voice features of the voice to be recognized are extracted from the spectrogram.
S205, through the circulation network, the sequence feature is continuously extracted on the basis of the convolution feature extracted by the convolution network.
After the power grid dispatching spectrogram characteristics are obtained, the sequence characteristics are purified by a trained bidirectional LSTM-based coder-decoder, and an output probability distribution matrix is obtained.
S206 uses CTC to align input and output, and maps the input and output into characters through a dictionary num2 word.
And processing the generated probability distribution matrix, performing input and output alignment mapping on the output probability distribution matrix, and converting the result into characters through a dictionary num2 word.
S207 outputs the result.
And the output module is used for verifying and outputting the power grid dispatching voice recognition result.
The power grid dispatching voice recognition method based on the spectrogram is disclosed by the embodiment of the invention. The traditional voice characteristics use various artificially designed filter banks to extract characteristics after Fourier transform, which causes information loss in a frequency domain, the information loss in a high-frequency region is particularly obvious, and the traditional voice characteristics need to adopt very large frame shift for calculation amount consideration, which undoubtedly causes information loss in a time domain, and the traditional voice characteristics are more prominently represented when the speaking speed is higher. Therefore, the power grid dispatching spectrogram is directly used as input, and the method has natural advantages compared with other voice recognition frameworks which take traditional voice characteristics as input. Secondly, from the view of a model structure, the method is different from the CNN method in the traditional voice recognition, and the accuracy rate is enhanced by utilizing the Incep convolution structure to process the spectrogram. Finally, the invention utilizes CTC to realize end-to-end training of the whole model, and the special structure of the contained pooling network and the like can make the training more stable.
Fig. 8 is a schematic structural diagram of another power grid dispatching voice recognition device based on a spectrogram according to an embodiment of the present invention. The device for power grid dispatching voice recognition based on the spectrogram comprises:
a communication interface 301, a memory 302, a processor 303 and a computer program stored on the memory 302 and executable on the processor 302.
The processor 303, when executing the program, implements the spectrogram-based power grid scheduling speech recognition method as claimed in any one of the claims.
A communication interface 301 for communication between the memory 302 and the processor 303.
A memory 302 for storing computer programs operable on the processor 303.
The memory 302 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor 303 is configured to implement the spectrogram-based power grid scheduling speech recognition method according to the foregoing embodiment when executing the program.
If the communication interface 301, the memory 302 and the processor 303 are implemented independently, the communication interface 301, the memory 302 and the processor 23 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a PCI (peripheral component interconnect) bus, or an extended ISA (ElSA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is used, but not to indicate only one bus or type of bus.
Optionally, in a specific implementation, if the communication interface 301, the memory 302, and the processor 303 are integrated into a chip, the communication interface 301, the memory 302, and the processor 303 may complete communication with each other through an internal interface.
The processor 303 may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a 'computer-readable medium' can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having appropriate combinational logic gates, Programmable Gate Arrays (PGAs), Field Programmable Gate Arrays (FPGAs), and the like, as is known in the art, may be implemented.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
Interpretation of related nouns
Noun interpretation
The spectrogram is a voice spectrogram, which is generally obtained by processing a received time domain signal, and the abscissa of the spectrogram is time, the ordinate is frequency, and the coordinate point value is voice data energy. And a two-dimensional plane is adopted to express three-dimensional information, the size of the energy value is expressed by color, the color is dark, and the stronger the voice energy of the point is expressed.
Convolutional Neural Network (CNN): convolutional neural networks are a class of classical networks in deep learning. The convolution kernel is the core of the method, the classic CNN convolution kernel is composed of a small two-dimensional square matrix, two-dimensional data are processed through different step sizes and different boundary processing modes, and then data are compressed through pooling operation.
The attention mechanism aims to simulate a method for a human visual system to focus on a local part when processing a visual image. In this way, certain information in the data can be strengthened, the influence of other information can be restrained, and higher relevance and accuracy are provided for a specific task.
Resnet: a residual network. As the neural network deepens, a situation arises in which the accuracy on the training set decreases, and we can determine that it is not due to overfitting. To solve this problem, Resnet (residual network) is proposed. The method is mainly characterized in that shortcut connection is adopted, and the input is directly connected to the output on the network.
Googlenet: ***net proposes the most direct method for improving the depth neural network to increase the depth and width, but the deepening and widening of the network can result in the increase of the over-fitting risk and the increase of the computing resources, for which Googlenet proposes an inclusion structure, and uses a sparse representation instead of the idea of full connection to alleviate the two problems.
Inclusion: the inclusion module can intuitively understand that a plurality of convolution operations or pooling operations are performed on an input image in parallel, and all output results are spliced into a feature map including multi-scale features.
inclusion-Resne: the inclusion-Resnet is a product of the combination of inclusion and Resnet. Different convolution operations and pooling operations can obtain different information of the input image, and processing these operations in parallel and combining all the results will obtain better image representation. Meanwhile, the shortcut structure can relieve the influence caused by gradient disappearance in a deep network to a certain extent.
RNN: RNNs (recurrent neural networks) are mostly used to process sequence data, and their core feature is that for a certain structure cycle, the output of the previous state participates in the input of the next state.
LSTM: LSTM (long short term memory network) is a time-recursive neural network suitable for processing and predicting important events of relatively long intervals and delays in a time series. The internal design of the LSTM loop element allows the current element to be associated with data elements that are far or near.
Bidirectional LSTM: bidirectional LSTM adds reverse information transfer, i.e., never-to-now effect, over conventional LSTM, and occurs to address the situation where the current output of the sequence depends on past and future data.
Dropout network Dropout means that in the training process of the deep learning network, the neural network unit is temporarily discarded from the network according to a certain probability in order to prevent overfitting.
Pooling network: the Pooling network, namely the Pooling network compresses the input characteristic diagram, so that the characteristic diagram is reduced, and the network computation complexity is simplified; on one hand, feature compression is carried out, and main features are extracted.
inclusion-Resnet-a network: the Incep-Resnet-A network refers in particular to an Incep-Resnet-A structure in the Incep-Resnet. The input feature graph is convolved through multiple branches with different convolution kernel sizes, and the shortcut idea of Resnet is fused to further extract features.
A Stem network: the Stem network refers in particular to a Stem structure in the inclusion-respet, and the Stem network extracts convolution characteristics on different sizes through multi-network convolution on input data and multi-branch pooling and convolution operations.
And (4) reducing the dimension of the features extracted by the increment-Resnet-A network by the aid of the features extracted by multi-branch fusion pooling and convolution kernels with different step lengths.
CTC is known as ConnectionsistorallClassification. The method mainly solves the problem of corresponding input sequences and output sequences.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
Claims (3)
1. A power grid dispatching voice recognition method based on a spectrogram is characterized in that: the method comprises the following steps:
firstly, performing spectrum analysis on the power grid scheduling voice to be recognized, and converting continuous power grid scheduling voice to be recognized into a spectrogram to represent;
step two, utilizing an inclusion convolution structure in an image recognition algorithm to extract convolution characteristics of the spectrogram image of the input power grid dispatching language, and converting an extraction result into a convolution characteristic matrix;
thirdly, continuously extracting sequence characteristics on the basis of the convolution characteristics extracted by the convolution network through a circulation network, and outputting the sequence characteristics as a probability distribution matrix;
step four, aligning input and output by using CTC for an output probability distribution matrix predicted by the cyclic network, and mapping the input and output into characters through a dictionary num2 word;
and fifthly, verifying and outputting the character result recognized by the power grid dispatching voice.
2. The power grid dispatching voice recognition method based on the spectrogram according to claim 1, wherein: the circulating network is a deep network bidirectional LSTM network.
3. The spectrogram-based power grid dispatching voice recognition device according to claim 1, wherein: the system comprises the following modules:
the speech spectrogram acquiring module is used for performing spectral analysis on the speech to be recognized to obtain a speech spectrogram of the speech to be recognized;
the feature extraction module is used for extracting features of the spectrogram by utilizing an inclusion convolution structure in an image recognition algorithm to obtain voice features of the voice to be recognized, and fusing the features of the spectrogram by adopting an image feature extraction network to obtain a convolution feature matrix of the spectrogram;
the sequence extraction module is used for purifying sequence characteristics by using a trained bidirectional LSTM-based coder-decoder to obtain an output probability distribution matrix;
the alignment module is used for carrying out input and output alignment mapping on the output probability distribution matrix and converting the result into characters through a dictionary num2 word;
and the output module is used for outputting the voice recognition result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911004454.5A CN110992941A (en) | 2019-10-22 | 2019-10-22 | Power grid dispatching voice recognition method and device based on spectrogram |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911004454.5A CN110992941A (en) | 2019-10-22 | 2019-10-22 | Power grid dispatching voice recognition method and device based on spectrogram |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110992941A true CN110992941A (en) | 2020-04-10 |
Family
ID=70082292
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911004454.5A Pending CN110992941A (en) | 2019-10-22 | 2019-10-22 | Power grid dispatching voice recognition method and device based on spectrogram |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110992941A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401530A (en) * | 2020-04-22 | 2020-07-10 | 上海依图网络科技有限公司 | Recurrent neural network and training method thereof |
CN112163514A (en) * | 2020-09-26 | 2021-01-01 | 上海大学 | Method and device for identifying traditional Chinese characters and readable storage medium |
CN112349288A (en) * | 2020-09-18 | 2021-02-09 | 昆明理工大学 | Chinese speech recognition method based on pinyin constraint joint learning |
CN113470620A (en) * | 2021-07-06 | 2021-10-01 | 青岛洞听智能科技有限公司 | Speech recognition method |
CN113688210A (en) * | 2021-09-06 | 2021-11-23 | 北京科东电力控制***有限责任公司 | Power grid dispatching intention identification method |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103985382A (en) * | 2014-05-23 | 2014-08-13 | 国家电网公司 | Power grid dispatching auxiliary interactive method based on speech recognition technology |
CN106710589A (en) * | 2016-12-28 | 2017-05-24 | 百度在线网络技术(北京)有限公司 | Artificial intelligence-based speech feature extraction method and device |
CN106910497A (en) * | 2015-12-22 | 2017-06-30 | 阿里巴巴集团控股有限公司 | A kind of Chinese word pronunciation Forecasting Methodology and device |
CN107729987A (en) * | 2017-09-19 | 2018-02-23 | 东华大学 | The automatic describing method of night vision image based on depth convolution loop neutral net |
CN107871496A (en) * | 2016-09-23 | 2018-04-03 | 北京眼神科技有限公司 | Audio recognition method and device |
US20180174576A1 (en) * | 2016-12-21 | 2018-06-21 | Google Llc | Acoustic-to-word neural network speech recognizer |
US20180261213A1 (en) * | 2017-03-13 | 2018-09-13 | Baidu Usa Llc | Convolutional recurrent neural networks for small-footprint keyword spotting |
CN109003601A (en) * | 2018-08-31 | 2018-12-14 | 北京工商大学 | A kind of across language end-to-end speech recognition methods for low-resource Tujia language |
CN109147775A (en) * | 2018-10-18 | 2019-01-04 | 深圳供电局有限公司 | Voice recognition method and device based on neural network |
CN109272993A (en) * | 2018-08-21 | 2019-01-25 | 中国平安人寿保险股份有限公司 | Recognition methods, device, computer equipment and the storage medium of voice class |
CN109272988A (en) * | 2018-09-30 | 2019-01-25 | 江南大学 | Audio recognition method based on multichannel convolutional neural networks |
CN109272990A (en) * | 2018-09-25 | 2019-01-25 | 江南大学 | Audio recognition method based on convolutional neural networks |
CN109448707A (en) * | 2018-12-18 | 2019-03-08 | 北京嘉楠捷思信息技术有限公司 | Voice recognition method and device, equipment and medium |
CN109559737A (en) * | 2018-12-13 | 2019-04-02 | 朱明增 | Electric power system dispatching speech model method for building up |
CN110232533A (en) * | 2019-07-10 | 2019-09-13 | 国网江苏省电力有限公司无锡供电分公司 | A kind of power grid job order scheduling system and method |
CN110265034A (en) * | 2019-04-12 | 2019-09-20 | 国网浙江省电力有限公司衢州供电公司 | A kind of power grid regulation auto-answer method |
-
2019
- 2019-10-22 CN CN201911004454.5A patent/CN110992941A/en active Pending
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103985382A (en) * | 2014-05-23 | 2014-08-13 | 国家电网公司 | Power grid dispatching auxiliary interactive method based on speech recognition technology |
CN106910497A (en) * | 2015-12-22 | 2017-06-30 | 阿里巴巴集团控股有限公司 | A kind of Chinese word pronunciation Forecasting Methodology and device |
CN107871496A (en) * | 2016-09-23 | 2018-04-03 | 北京眼神科技有限公司 | Audio recognition method and device |
US20180174576A1 (en) * | 2016-12-21 | 2018-06-21 | Google Llc | Acoustic-to-word neural network speech recognizer |
CN106710589A (en) * | 2016-12-28 | 2017-05-24 | 百度在线网络技术(北京)有限公司 | Artificial intelligence-based speech feature extraction method and device |
US20180261213A1 (en) * | 2017-03-13 | 2018-09-13 | Baidu Usa Llc | Convolutional recurrent neural networks for small-footprint keyword spotting |
CN107729987A (en) * | 2017-09-19 | 2018-02-23 | 东华大学 | The automatic describing method of night vision image based on depth convolution loop neutral net |
CN109272993A (en) * | 2018-08-21 | 2019-01-25 | 中国平安人寿保险股份有限公司 | Recognition methods, device, computer equipment and the storage medium of voice class |
CN109003601A (en) * | 2018-08-31 | 2018-12-14 | 北京工商大学 | A kind of across language end-to-end speech recognition methods for low-resource Tujia language |
CN109272990A (en) * | 2018-09-25 | 2019-01-25 | 江南大学 | Audio recognition method based on convolutional neural networks |
CN109272988A (en) * | 2018-09-30 | 2019-01-25 | 江南大学 | Audio recognition method based on multichannel convolutional neural networks |
CN109147775A (en) * | 2018-10-18 | 2019-01-04 | 深圳供电局有限公司 | Voice recognition method and device based on neural network |
CN109559737A (en) * | 2018-12-13 | 2019-04-02 | 朱明增 | Electric power system dispatching speech model method for building up |
CN109448707A (en) * | 2018-12-18 | 2019-03-08 | 北京嘉楠捷思信息技术有限公司 | Voice recognition method and device, equipment and medium |
CN110265034A (en) * | 2019-04-12 | 2019-09-20 | 国网浙江省电力有限公司衢州供电公司 | A kind of power grid regulation auto-answer method |
CN110232533A (en) * | 2019-07-10 | 2019-09-13 | 国网江苏省电力有限公司无锡供电分公司 | A kind of power grid job order scheduling system and method |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401530A (en) * | 2020-04-22 | 2020-07-10 | 上海依图网络科技有限公司 | Recurrent neural network and training method thereof |
CN111401530B (en) * | 2020-04-22 | 2021-04-09 | 上海依图网络科技有限公司 | Training method for neural network of voice recognition device |
WO2021212684A1 (en) * | 2020-04-22 | 2021-10-28 | 上海依图网络科技有限公司 | Recurrent neural network and training method therefor |
CN112349288A (en) * | 2020-09-18 | 2021-02-09 | 昆明理工大学 | Chinese speech recognition method based on pinyin constraint joint learning |
CN112163514A (en) * | 2020-09-26 | 2021-01-01 | 上海大学 | Method and device for identifying traditional Chinese characters and readable storage medium |
CN113470620A (en) * | 2021-07-06 | 2021-10-01 | 青岛洞听智能科技有限公司 | Speech recognition method |
CN113688210A (en) * | 2021-09-06 | 2021-11-23 | 北京科东电力控制***有限责任公司 | Power grid dispatching intention identification method |
CN113688210B (en) * | 2021-09-06 | 2024-02-09 | 北京科东电力控制***有限责任公司 | Power grid dispatching intention recognition method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110992941A (en) | Power grid dispatching voice recognition method and device based on spectrogram | |
CN104143327B (en) | A kind of acoustic training model method and apparatus | |
CN111199727B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
WO2021051544A1 (en) | Voice recognition method and device | |
CN106227721B (en) | Chinese Prosodic Hierarchy forecasting system | |
CN113901799B (en) | Model training method, text prediction method, model training device, text prediction device, electronic equipment and medium | |
CN109840287A (en) | A kind of cross-module state information retrieval method neural network based and device | |
CN111368514B (en) | Model training and ancient poem generating method, ancient poem generating device, equipment and medium | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
CN109801630A (en) | Digital conversion method, device, computer equipment and the storage medium of speech recognition | |
Sartakhti et al. | Persian language model based on BiLSTM model on COVID-19 corpus | |
Zhu et al. | Robust data2vec: Noise-robust speech representation learning for asr by combining regression and improved contrastive learning | |
CN112259080A (en) | Speech recognition method based on neural network model | |
CN112489634A (en) | Language acoustic model training method and device, electronic equipment and computer medium | |
CN116010874A (en) | Emotion recognition method based on deep learning multi-mode deep scale emotion feature fusion | |
CN111563161A (en) | Sentence recognition method, sentence recognition device and intelligent equipment | |
CN111090726A (en) | NLP-based electric power industry character customer service interaction method | |
CN114548116A (en) | Chinese text error detection method and system based on language sequence and semantic joint analysis | |
CN114333759A (en) | Model training method, speech synthesis method, apparatus and computer program product | |
CN117094383B (en) | Joint training method, system, equipment and storage medium for language model | |
CN113254646A (en) | News information classification method and device | |
CN116705073A (en) | Voice emotion recognition method based on bimodal and attentive mechanism | |
CN109977372A (en) | The construction method of Chinese chapter tree | |
CN113689866B (en) | Training method and device of voice conversion model, electronic equipment and medium | |
CN113221546B (en) | Mobile phone banking information data processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200410 |