CN110992941A

CN110992941A - Power grid dispatching voice recognition method and device based on spectrogram

Info

Publication number: CN110992941A
Application number: CN201911004454.5A
Authority: CN
Inventors: 崇志强; 姚宗强; 徐福华; 周作静; 马世乾; 尚学军; 李国栋; ***; 霍现旭; 杨晓静; 黄志刚; 郭悦; 郭凌旭; 龚成虎; 鄂志君; 陈培育; 于光耀; 王天昊; 杨邦宇; 李振斌
Original assignee: Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd; Jinghai Power Supply Co of State Grid Tianjin Electric Power Co Ltd
Current assignee: Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd; Jinghai Power Supply Co of State Grid Tianjin Electric Power Co Ltd
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2020-04-10

Abstract

The invention relates to a power grid dispatching voice recognition method and device based on a spectrogram, wherein continuous power grid dispatching voice to be recognized is converted into the spectrogram for representation by performing spectrum analysis on the power grid dispatching voice to be recognized; the Incep convolution structure in the image recognition algorithm is used for extracting convolution characteristics of the spectrogram image of the input power grid dispatching language, and an extraction result is converted into a convolution characteristic matrix; continuously extracting sequence characteristics on the basis of the convolution characteristics extracted by the convolution network through a circulation network, and outputting the sequence characteristics as a probability distribution matrix; aligning input and output by using a CTC (central control unit) for an output probability distribution matrix predicted by a cyclic network, and mapping the input and output into characters through a num2word of a dictionary; and checking and outputting the character result recognized by the power grid dispatching voice. The invention ensures consistent pronunciation and character by improving the accuracy of voice recognition, thereby reducing the intensity and pressure of operation of a dispatcher.

Description

Power grid dispatching voice recognition method and device based on spectrogram

Technical Field

The invention belongs to the field of electric power system scheduling, relates to man-machine intelligent interaction, and particularly relates to a power grid scheduling voice recognition method and device based on a spectrogram.

Background

With the explosive growth of new energy and distributed photovoltaic projects, the power grid is connected into, and the operation of the power grid is influenced by various uncertain factors. The operation intensity and efficiency of the dispatcher face huge pressure, and artificial intelligence is the optimal choice for solving the problems. The optimal selection of the intelligent operation of the power grid dispatching is the voice recognition technology.

In the traditional speech recognition technology, various artificially designed filter banks are used for extracting features after Fourier transform, so that information loss in a frequency domain is caused, the information loss in a high-frequency region is particularly obvious, in addition, the traditional speech features need to adopt very large frame shift for the consideration of calculated amount, so that the information loss in a time domain is undoubtedly caused, and the speech recognition technology is more remarkably shown when the speaking speed is fast.

The speech recognition can be realized by a machine learning classification algorithm combined with the traditional speech recognition manual characteristics of power grid scheduling, but the generated Chinese sentences have poor continuity and readability no matter from an objective angle or a subjective angle. The application of the deep learning network improves the problem, but the defects of low recognition rate and the like still exist.

Through a search for a patent publication, no patent publication that is the same as the present patent application is found.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a power grid dispatching voice recognition method and device based on a spectrogram.

The technical problem to be solved by the invention is realized by the following technical scheme:

a power grid dispatching voice recognition method based on a spectrogram is characterized in that: the method comprises the following steps:

firstly, performing spectrum analysis on the power grid scheduling voice to be recognized, and converting continuous power grid scheduling voice to be recognized into a spectrogram to represent;

step two, utilizing an inclusion convolution structure in an image recognition algorithm to extract convolution characteristics of the spectrogram image of the input power grid dispatching language, and converting an extraction result into a convolution characteristic matrix;

thirdly, continuously extracting sequence characteristics on the basis of the convolution characteristics extracted by the convolution network through a circulation network, and outputting the sequence characteristics as a probability distribution matrix;

step four, aligning input and output by using CTC for an output probability distribution matrix predicted by the cyclic network, and mapping the input and output into characters through a dictionary num2 word;

and fifthly, verifying and outputting the character result recognized by the power grid dispatching voice.

Furthermore, the circulating network is a deep network bidirectional LSTM network.

The utility model provides a power grid dispatching speech recognition device based on spectrogram which characterized in that: the system comprises the following modules:

the speech spectrogram acquiring module is used for performing spectral analysis on the speech to be recognized to obtain a speech spectrogram of the speech to be recognized;

the feature extraction module is used for extracting features of the spectrogram by utilizing an inclusion convolution structure in an image recognition algorithm to obtain voice features of the voice to be recognized, and fusing the features of the spectrogram by adopting an image feature extraction network to obtain a convolution feature matrix of the spectrogram;

the sequence extraction module is used for purifying sequence characteristics by using a trained bidirectional LSTM-based coder-decoder to obtain an output probability distribution matrix;

the alignment module is used for carrying out input and output alignment mapping on the output probability distribution matrix and converting the result into characters through a dictionary num2 word;

and the output module is used for outputting the voice recognition result.

The invention has the advantages and beneficial effects that:

according to the power grid dispatching voice recognition method and device based on the spectrogram, the power grid dispatching spectrogram is directly used as input, and the method and device have natural advantages compared with other voice recognition frameworks which take traditional voice characteristics as input. Secondly, from the view of a model structure, the method is different from the CNN method in the traditional voice recognition, and the accuracy rate is enhanced by utilizing the Incep convolution structure to process the spectrogram. Finally, the invention utilizes CTC to realize end-to-end training of the whole model, and the special structure of the contained pooling network and the like can make the training more stable.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a power grid dispatching spectrogram of the present invention;

FIG. 3 is a graph of a convolution network model of a spectrogram image of the present invention;

FIG. 4 is a diagram of the Incep-Resnet-A structure of the present invention;

FIG. 5 is a structural view of Reduction A according to the present invention;

FIG. 6 is a diagram of the structure of BilSTM according to the present invention;

FIG. 7 is a flow chart of another method of the present invention;

fig. 8 is a structure diagram of a speech recognition device based on spectrogram for power grid dispatching.

Detailed Description

The present invention is further illustrated by the following specific examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes a power grid dispatching voice recognition method and device based on a spectrogram according to an embodiment of the invention with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a power grid dispatching voice recognition method based on a spectrogram according to an embodiment of the present invention. The power grid dispatching voice recognition method based on the spectrogram comprises the following steps of:

s101, carrying out spectrum analysis on the power grid dispatching voice to be recognized to obtain a power grid dispatching voice spectrogram of the voice to be recognized.

Specifically, the power grid dispatching to-be-recognized voice is sampled according to a preset period, and each voice frame of the power grid dispatching to-be-recognized voice is obtained. In this embodiment, filter banks are preset, and a preset number of filters in each filter bank set different filtering frequencies for different filters. Furthermore, each speech frame is filtered by using a filter bank comprising a preset number of filters, and each frequency spectrum component included in each speech frame can be obtained after the speech frame is filtered by the filter bank because the filtering of each filter in the filter bank is different.

Further, after obtaining each frequency spectrum component of each speech frame, performing fast fourier transform on each speech frame to obtain a frequency spectrum value of each speech frame, and representing the short-time average energy of the speech frame by the frequency spectrum value. And then generating a spectrogram of the voice to be recognized by using the obtained frequency spectrum value of each voice frame. The horizontal coordinate of a spectrogram of the voice to be recognized is time corresponding to the voice frame, the vertical coordinate of the spectrogram is a frequency component contained in the voice frame, and the coordinate point value of the spectrogram is a frequency spectrum value.

For example, 10ms is preset as a period of one sample, and one speech frame is formed every 10 ms. And setting a filter bank comprising 40 filters, and filtering each 10ms voice frame through the filter bank to obtain the filterbank characteristic of each voice frame. Because the filter bank comprises 40 filters, after passing through the filter bank, each speech frame can extract 40 filter-bank features, and a section of continuous speech to be recognized can form a two-dimensional image format according to the extracted 40 filter-bank features, that is, a spectrogram for dispatching the speech to be recognized by the power grid is formed, as shown in fig. 2.

S102, an inclusion convolution structure in an image recognition algorithm is used for extracting convolution characteristics of the input power grid dispatching spectrogram image, and an extraction result is converted into a convolution characteristic matrix to obtain power grid dispatching spectrogram image characteristics.

In this embodiment, in order to improve the accuracy of speech recognition, an inclusion convolution structure in an image recognition algorithm may be used in the acoustic recognition model to extract convolution features of an input power grid dispatching spectrogram image, and an extraction result is converted into a convolution feature matrix. Specifically, the inclusion convolution structure includes a plurality of convolution modules, and each convolution module includes a plurality of convolution networks and a pooling network. Wherein the pooled network is used for down-sampling in the time and/or frequency domain.

The Incep feature extraction network is formed by sequentially connecting a plurality of Incep-Resnet modules with attention mechanisms, and each Incep-Resnet module with attention mechanisms is formed by sequentially connecting an improved Incep-Resnet unit, a convolution network with attention mechanisms and a pooling network.

In this embodiment, a general structural diagram of the spectrogram image feature extraction network is shown in fig. 3, the inclusion-Resnet feature extraction network is formed by sequentially connecting a plurality of inclusion-Resnet modules with a attention mechanism, and the attention mechanism is mainly embodied in convolution operation in the inclusion-Resnet modules. The expression-Resnet module with attention mechanism is based on the expression mechanism of Googlenet and the bypass strategy of Resnet, and the expression-Resnet unit used in the embodiment mainly comprises a Stem network, 5 expression-Resnet-A networks, a ReducionA network, 10 expression-Resnet-A networks, a ReducionA network, 5 expression-Resnet-A networks, an average pooling network and a Dropout network which are connected in sequence.

The inclusion-Resnet-a network uses multi-branch convolution, reduces the amount of computation using 1 × 1 convolution, and the combination of convolution kernels of different sizes can extract features in more sizes: the ReductionA network samples the feature map using a convolution operation with step size 2. The Stem network, the inclusion-Resnet-A network and the Reduction-A network are all stacks suitable for parallel convolution operation and pooling operation, and due to the design of multiple branches, different branches can be calculated in parallel during model implementation, and training efficiency is improved.

The internal result of the inclusion-Resnet-A network is shown in FIG. 4, and a strategy of paralleling four lines is used, so that the diversity of features is greatly enriched, and the reliability of the features is increased. The symbol "+" in the circle in the figure represents the operation of the feature combination, the results of the feature combination and the feature addition ensure the integrity of the features extracted by convolution kernels with different sizes, the direct connection structure embodies the idea of residual errors in the Resnet network, and when the network depth is deepened, the direct connection structure can avoid the gradient disappearance to a certain extent, so that the updating of shallow network parameters is promoted.

Fig. 5 shows the internal structure of the reduction a network connected to the inclusion-respet-a network. The ReductonA network works mainly to reduce the features extracted in the last step. Here, a parallel strategy is also used, which is composed of an average pooling network, a single-network convolutional network, and a multi-network convolutional network. Parameters which can be manually adjusted in the model training process aim at enriching the diversity of the model training and reducing the information loss brought by the reduction process.

Between the individual inclusion-Resnet modules, a convolutional network with attention mechanism is used for the transition. The training mode of the Incep-Resnet modules is module-by-module pre-training, the weight of each Incep-Resnet module of the network is pre-trained, and the convergence speed in the process of training the complete model is increased. During model training, the output of the former inclusion-Resnet module is transformed into a feature map with a specific channel number by using a three-dimensional 1 x 1 convolution kernel before convolution, wherein each channel corresponds to a key feature bit. For an image, the weight of each pixel position is stored by using a matrix with the same size, and a specific area of the weight matrix is assigned with larger weight through training. And obtaining a weight matrix which embodies the attention mechanism at the joint of each increment-Resnet module after training is finished. When the test is carried out, the weight matrix of each connection position and the output of the last network are used for operation, and the influence of a specific area on the feature vector is strengthened. It should be noted that, during the test, the output of the previous network is first converted to a specific number of channels, so as to be able to perform the operation with the weight matrix, and after the operation is finished, the operation result needs to be restored to the original format to satisfy the requirement of the input of the next network. And finally obtaining the characteristic vector of the power grid dispatching spectrogram after the original image is subjected to multiple operations of the modules.

S103, continuously extracting sequence features on the basis of the convolution features extracted by the convolution network through a circulation network, namely a deep network bidirectional LSTM network.

And taking all power grid dispatching spectrogram image characteristics as input, training a BiLSTM constructed RNN coder-decoder structure, finally obtaining a sequence characteristic extraction network formed by connecting a spectrogram image characteristic extraction network and a trained bidirectional LSTM coder-decoder, and finally obtaining a probability distribution matrix through softmax.

Further as a preferred embodiment, referring to fig. 6, the encoder and the decoder of the bidirectional LSTM encoder-decoder each include a plurality of LSTM cycle units, and an output of each LSTM cycle unit of the encoder includes a connection weight with a plurality of LSTM cycle units of the decoder.

The present embodiment employs a bi-directional LSTM codec as a means of refining the sequence vector. The specific structure of a bi-directional LSTM based codec is shown in fig. 6. Both the encoder and decoder are constructed of bi-directional LSTM cyclic units. The figure uses simple two-way arrows to show the concept of two-way, however, in the actual construction, two-way training of an LSTM network is used, which can be understood as two "opposite LSTM". After one forward propagation during training, the input sequence is then input backward into the LSTM. Taking the current position as a reference point, the output of each current state depends on the current input and the part in front of the "forward" LSTM and the back of the "backward" LSTM. The decoder has the same structure as the encoder. In the training process of the bidirectional LSTM coder-decoder, a BPTT back propagation algorithm is adopted for training.

S104, aligning input and output by using CTC for the output probability distribution matrix predicted by the cyclic network, and then mapping the input and output into characters through a dictionary num2 word.

In this embodiment, in order to improve the accuracy of speech recognition, the CTC is used to solve the problem of input and output correspondence, and then the words are mapped into a dictionary num2 word. CTC is a Loss computing method, and replaces SoftmaxLos with CTC. CTC has the advantage of introducing blank characters, solving the problem of having no characters in some locations. By recursion, the gradient is calculated quickly.

The RNN is followed by the CTC algorithm. The input of the RNN model is audio segments, the output number is the same as the input dimensionality, T audio segments are provided, probability vectors of T dimensionalities are output, and each vector consists of the probability of the number of dictionaries. For example, if the number of network input audios is T and the number of different words in the dictionary is N, the dimension of RNN output is T × N. Based on this probability output distribution, we can get the most possible output result.

For a given input X, the training model maximizes the expected posterior probability P (Y | X) of Y, which can be derived, and then the model is trained using gradient descent.

The CTC algorithm can provide a large number of Y conditional probability outputs for an input X (RNN output probability distribution matrix, where a large number of Y values can be obtained as final outputs by combining elements in the matrix), and the CTC algorithm is used to align the inputs and outputs in the process of computing the outputs.

The loss function is calculated, and the goal of CTC is to maximize the probability of the following for a pair of input outputs.

For the model, the RNN outputs the probability of pt (at | X), and t represents the notion of time within the RNN. Multiplication means all character probabilities of one path are multiplied and addition means multiple paths. Since CTC aligned input output is many-to-one a conditional probability that adding all paths is the output. The calculation of P (Y | X) is derivable since it involves only addition and multiplication. The goal of model optimization is to minimize the negative log-likelihood function.

And (4) obtaining a node with the maximum RNN output probability each time through greedy algorithm and cluster searching, and then obtaining an output result through de-duplication.

The method comprises the steps of carrying out word segmentation processing on a speech recognition and annotation set by adopting a Chinese word segmentation component, establishing a word frequency dictionary according to the occurrence frequency of each word after word segmentation from high to low, and establishing an index by giving a unique number to each word. And establishing a single hidden network neural network model to extract word embedding characteristics of the description set, and mapping the image characteristic descriptors to a word embedding space.

And performing one-hot coding on each word in the word frequency dictionary according to the word frequency dictionary scale. A word embedding matrix is initialized randomly, the row number of the matrix represents the total number of words in the dictionary, and the column number of the matrix represents the dimension of the word embedding characteristic vector. And establishing a single hidden network neural network model, and performing product operation on the input one-hot vector and the word embedding matrix to obtain the word embedding characteristic vector of each word. And splicing the word embedded matrix to be used as hidden network input of a neural network, finally outputting and jointly constructing a cross entropy loss function, and optimizing by adopting a back propagation algorithm to obtain a final word embedded matrix.

And S105, outputting the character result obtained by the power grid dispatching voice recognition.

The results are evaluated and are not specifically limited herein. And outputting the voice recognition result.

The method utilizes AISHELL data set to carry out model base training. And meanwhile, a real recording data set is scheduled by adopting a power grid for training. The data set is a power grid dispatching telephone recording set which is from different areas in China and has different accents; recording is carried out in a real power grid dispatching center by using a high-fidelity microphone, and sampling is reduced to 16kHz to manufacture a power grid dispatching scene voice data set.

The power grid dispatching voice recognition method based on the spectrogram is disclosed by the embodiment of the invention. The traditional voice characteristics use various artificially designed filter banks to extract characteristics after Fourier transform, which causes information loss in a frequency domain, the information loss in a high-frequency region is particularly obvious, and the traditional voice characteristics need to adopt very large frame shift for calculation amount consideration, which undoubtedly causes information loss in a time domain, and the traditional voice characteristics are more prominently represented when the speaking speed is higher. Therefore, the power grid dispatching spectrogram is directly used as input, and the method has natural advantages compared with other voice recognition frameworks which take traditional voice characteristics as input. Secondly, from the view of a model structure, the method is different from the CNN method in the traditional voice recognition, and the accuracy rate is enhanced by utilizing the Incep convolution structure to process the spectrogram. Finally, the invention utilizes CTC to realize end-to-end training of the whole model, and the special structure of the contained pooling network and the like can make the training more stable.

Fig. 7 is a schematic flow chart of another power grid scheduling speech recognition method based on a spectrogram according to an embodiment of the present invention. The method comprises the following steps:

s201, constructing a data generator and constructing a model network structure.

In this embodiment, the data generator is constructed in advance, and is used for generating training data, including generating input features and labels, and converting the data into a specific format. And constructing a model network structure for final training and recognition setting.

Preferably, the preset data format is arranged into a format which can be accepted by the network, and the data format is modified into a format which can be processed by the batch. And the construction data generator is used for generating model training data. The data format sets formats such as voice file format (pcm (uncompressed), wav (uncompressed, pcm coded), amr (compressed format)) and 8k/16k sampling rate, 16bit depth, mono, etc., which will not be described herein. The data generator can refer to the record of the generator and the like, and the details are not repeated here.

Preferably, a network structure is constructed for final training and recognition. Sequentially constructing a CNN-inclusion convolution structure, an RNN-BilSTM network structure, a CTC loss calculation network and an input/output interface.

Preferably, the inclusion convolution structure includes a plurality of convolution modules. The convolution modules respectively comprise a shunt network, 4 convolution networks with convolution kernels of 1 x 1, 1 convolution network with convolution kernels of 3 x 3, 1 convolution network with convolution kernels of 5 x 1, 1 convolution network with convolution kernels of 1 x 5, a pooling network and a characteristic splicing network, wherein the convolution network with convolution kernels of 5 x 1 and the convolution network with convolution kernels of 1 x 5 form the convolution network with convolution kernels of 5 x 5, the scale of the pooling network is 2 x 2, and the convolution step is 1.

Preferably, the encoder and decoder of the bidirectional LSTM encoder-decoder each comprise a plurality of LSTM cycle units, and the output of each LSTM cycle unit of the encoder comprises a connection weight with a plurality of LSTM cycle units of the decoder. Here, the connection weight indicates a range of interest for a plurality of LSTM cycle units of the decoder, and a higher connection weight indicates a higher degree of correlation, for example, a connection weight of 0 indicates no degree of correlation.

Preferably, a CTC loss function is constructed, and a gradient calculation model is constructed. A gradient calculation model of the loss function with respect to the (non-normalized) output probability is constructed, and a back propagation model is constructed.

S202, training the set network structure.

In this embodiment, training data converted into a preset data structure is imported, a pre-constructed model is trained, and feedback correction is performed on the model.

Preferably, the audio files are processed, including generating a general file list, feature extraction, and the like. And processing the text label file, including generating a mapping from pinyin to numbers and converting the pinyin label into a label conversion of the numbers.

Preferably, data required by training is prepared, a model structure is imported, a model is trained, and model parameters are saved.

Preferably, the matrix of model predicted softmax is decoded using CTC criteria and then converted to text by the dictionary num2 word.

Preferably, the model is tested, test data is prepared, a dictionary is generated, the trained model is loaded, recognition is performed, and the numerical result is converted into a text result.

S203, carrying out spectrum analysis on the power grid dispatching voice to be recognized to obtain a power grid dispatching voice recognition spectrogram.

For the acquisition process of the spectrogram, reference may be made to the description of the relevant contents in the above embodiments, which is not repeated herein.

S204, extracting the features of the power grid dispatching spectrogram by using a pre-constructed inclusion convolution structure to obtain the voice features of the voice to be recognized.

After a spectrogram of the voice to be recognized in power grid dispatching is obtained, the spectrogram is input into an inclusion convolution structure, the inclusion convolution structure can recognize the spectrogram, and the voice features of the voice to be recognized are extracted from the spectrogram.

S205, through the circulation network, the sequence feature is continuously extracted on the basis of the convolution feature extracted by the convolution network.

After the power grid dispatching spectrogram characteristics are obtained, the sequence characteristics are purified by a trained bidirectional LSTM-based coder-decoder, and an output probability distribution matrix is obtained.

S206 uses CTC to align input and output, and maps the input and output into characters through a dictionary num2 word.

And processing the generated probability distribution matrix, performing input and output alignment mapping on the output probability distribution matrix, and converting the result into characters through a dictionary num2 word.

S207 outputs the result.

And the output module is used for verifying and outputting the power grid dispatching voice recognition result.

Fig. 8 is a schematic structural diagram of another power grid dispatching voice recognition device based on a spectrogram according to an embodiment of the present invention. The device for power grid dispatching voice recognition based on the spectrogram comprises:

a communication interface 301, a memory 302, a processor 303 and a computer program stored on the memory 302 and executable on the processor 302.

The processor 303, when executing the program, implements the spectrogram-based power grid scheduling speech recognition method as claimed in any one of the claims.

A communication interface 301 for communication between the memory 302 and the processor 303.

A memory 302 for storing computer programs operable on the processor 303.

The memory 302 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 303 is configured to implement the spectrogram-based power grid scheduling speech recognition method according to the foregoing embodiment when executing the program.

If the communication interface 301, the memory 302 and the processor 303 are implemented independently, the communication interface 301, the memory 302 and the processor 23 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a PCI (peripheral component interconnect) bus, or an extended ISA (ElSA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is used, but not to indicate only one bus or type of bus.

Optionally, in a specific implementation, if the communication interface 301, the memory 302, and the processor 303 are integrated into a chip, the communication interface 301, the memory 302, and the processor 303 may complete communication with each other through an internal interface.

The processor 303 may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a 'computer-readable medium' can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having appropriate combinational logic gates, Programmable Gate Arrays (PGAs), Field Programmable Gate Arrays (FPGAs), and the like, as is known in the art, may be implemented.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

Interpretation of related nouns

Noun interpretation

The spectrogram is a voice spectrogram, which is generally obtained by processing a received time domain signal, and the abscissa of the spectrogram is time, the ordinate is frequency, and the coordinate point value is voice data energy. And a two-dimensional plane is adopted to express three-dimensional information, the size of the energy value is expressed by color, the color is dark, and the stronger the voice energy of the point is expressed.

Convolutional Neural Network (CNN): convolutional neural networks are a class of classical networks in deep learning. The convolution kernel is the core of the method, the classic CNN convolution kernel is composed of a small two-dimensional square matrix, two-dimensional data are processed through different step sizes and different boundary processing modes, and then data are compressed through pooling operation.

The attention mechanism aims to simulate a method for a human visual system to focus on a local part when processing a visual image. In this way, certain information in the data can be strengthened, the influence of other information can be restrained, and higher relevance and accuracy are provided for a specific task.

Resnet: a residual network. As the neural network deepens, a situation arises in which the accuracy on the training set decreases, and we can determine that it is not due to overfitting. To solve this problem, Resnet (residual network) is proposed. The method is mainly characterized in that shortcut connection is adopted, and the input is directly connected to the output on the network.

Googlenet: ***net proposes the most direct method for improving the depth neural network to increase the depth and width, but the deepening and widening of the network can result in the increase of the over-fitting risk and the increase of the computing resources, for which Googlenet proposes an inclusion structure, and uses a sparse representation instead of the idea of full connection to alleviate the two problems.

Inclusion: the inclusion module can intuitively understand that a plurality of convolution operations or pooling operations are performed on an input image in parallel, and all output results are spliced into a feature map including multi-scale features.

inclusion-Resne: the inclusion-Resnet is a product of the combination of inclusion and Resnet. Different convolution operations and pooling operations can obtain different information of the input image, and processing these operations in parallel and combining all the results will obtain better image representation. Meanwhile, the shortcut structure can relieve the influence caused by gradient disappearance in a deep network to a certain extent.

RNN: RNNs (recurrent neural networks) are mostly used to process sequence data, and their core feature is that for a certain structure cycle, the output of the previous state participates in the input of the next state.

LSTM: LSTM (long short term memory network) is a time-recursive neural network suitable for processing and predicting important events of relatively long intervals and delays in a time series. The internal design of the LSTM loop element allows the current element to be associated with data elements that are far or near.

Bidirectional LSTM: bidirectional LSTM adds reverse information transfer, i.e., never-to-now effect, over conventional LSTM, and occurs to address the situation where the current output of the sequence depends on past and future data.

Dropout network Dropout means that in the training process of the deep learning network, the neural network unit is temporarily discarded from the network according to a certain probability in order to prevent overfitting.

Pooling network: the Pooling network, namely the Pooling network compresses the input characteristic diagram, so that the characteristic diagram is reduced, and the network computation complexity is simplified; on one hand, feature compression is carried out, and main features are extracted.

inclusion-Resnet-a network: the Incep-Resnet-A network refers in particular to an Incep-Resnet-A structure in the Incep-Resnet. The input feature graph is convolved through multiple branches with different convolution kernel sizes, and the shortcut idea of Resnet is fused to further extract features.

A Stem network: the Stem network refers in particular to a Stem structure in the inclusion-respet, and the Stem network extracts convolution characteristics on different sizes through multi-network convolution on input data and multi-branch pooling and convolution operations.

And (4) reducing the dimension of the features extracted by the increment-Resnet-A network by the aid of the features extracted by multi-branch fusion pooling and convolution kernels with different step lengths.

CTC is known as ConnectionsistorallClassification. The method mainly solves the problem of corresponding input sequences and output sequences.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A power grid dispatching voice recognition method based on a spectrogram is characterized in that: the method comprises the following steps:

2. The power grid dispatching voice recognition method based on the spectrogram according to claim 1, wherein: the circulating network is a deep network bidirectional LSTM network.

3. The spectrogram-based power grid dispatching voice recognition device according to claim 1, wherein: the system comprises the following modules:

and the output module is used for outputting the voice recognition result.