CN111161724A - Method, system, equipment and medium for Chinese audio-visual combined speech recognition - Google Patents
Method, system, equipment and medium for Chinese audio-visual combined speech recognition Download PDFInfo
- Publication number
- CN111161724A CN111161724A CN201911297060.3A CN201911297060A CN111161724A CN 111161724 A CN111161724 A CN 111161724A CN 201911297060 A CN201911297060 A CN 201911297060A CN 111161724 A CN111161724 A CN 111161724A
- Authority
- CN
- China
- Prior art keywords
- character sequence
- sequence
- recognition model
- audio
- chinese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 238000013528 artificial neural network Methods 0.000 claims abstract description 58
- 230000005236 sound signal Effects 0.000 claims abstract description 49
- 230000007246 mechanism Effects 0.000 claims abstract description 18
- 230000000306 recurrent effect Effects 0.000 claims description 44
- 230000015654 memory Effects 0.000 claims description 25
- 239000000284 extract Substances 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 14
- 125000004122 cyclic group Chemical group 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 230000004927 fusion Effects 0.000 claims description 7
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 239000010410 layer Substances 0.000 description 33
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 3
- 230000006403 short-term memory Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000005111 flow chemistry technique Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000004660 morphological change Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
Abstract
The invention provides a method, a system, equipment and a medium for recognizing Chinese audio-visual combined voice, wherein the method comprises the following steps: respectively receiving a video signal and an audio signal to be identified; inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model; and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model. The invention provides an audio-visual combined voice recognition scheme aiming at Chinese sentence level based on an end-to-end mode, combines a deep neural network and an attention mechanism, fully excavates and fuses the characteristics of an audio signal and a video signal, and is beneficial to improving the recognition capability of a voice recognition system.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method, a system, equipment and a medium for recognizing Chinese audio-visual combined voice.
Background
Speech recognition technology is a technology that allows a machine to convert speech signals into corresponding text or commands through a process of recognition and understanding. In the last two decades, the development of speech recognition technology has made remarkable progress, and the development of artificial intelligence technology has started to move from laboratories to markets, especially in recent years, so that the research of speech recognition technology has made a major breakthrough. At present, the voice recognition technology is widely applied to the fields of vehicle-mounted systems, social chat, smart home and the like, provides much convenience for the life of people, and shows strong practicability of people.
In general, the input of speech recognition is only an audio signal, and if the speech recognition process can simultaneously utilize audio and video signals, the two signals can be mutually supplemented, so that the input information is richer, and the accuracy of recognition is improved.
For the speech recognition problem of audio-visual combination by simultaneously utilizing audio and video signals, at present, the solution specially aiming at Chinese is almost not available, most of the solutions aim at the speech recognition of English, or a specific language is not specified, but Chinese has the particularity, for example, words in Chinese do not have strict morphological change, the number of commonly used Chinese characters is large, about 3500 Chinese characters and the like, and the complex characteristics make the Chinese speech recognition task of audio-visual combination full of challenges, so the audio-visual combination speech recognition solution aiming at English or general audio-visual combination cannot be directly used for solving the Chinese speech recognition problem. In addition, in the existing audio-visual combined speech recognition scheme, the problem of recognition at a word level is solved, and a sentence-level recognition task during continuous speaking cannot be processed; some schemes use traditional machine learning methods to realize audio-visual combined speech recognition, and such methods need manual feature extraction, and have a complex process and limited final recognition effect.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a method, a system, equipment and a medium for recognizing Chinese audio-visual combined speech based on deep learning, and provides an end-to-end audio-visual combined speech recognition scheme aiming at Chinese sentence level.
The embodiment of the invention provides a method for recognizing Chinese audio-visual combined voice, which comprises the following steps:
respectively receiving a video signal and an audio signal to be identified;
inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model;
and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
Optionally, the pinyin character sequence recognition model includes a video encoder, an audio encoder, and a first decoder;
inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model, and the method comprises the following steps:
the video encoder extracts a characteristic sequence of the video signal;
the audio encoder extracts a characteristic sequence of the audio signal;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence;
and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.
Optionally, the video encoder extracts a feature sequence of the video signal, including the steps of:
inputting the image frame sequence of the video signal into a convolutional neural network to obtain the image characteristics of each frame of image output by the convolutional neural network;
inputting the image features into a first recurrent neural network, extracting time sequence features among the image frame sequences, and taking an output sequence of the first recurrent neural network as a feature sequence of the video signal.
Optionally, the audio encoder extracting a feature sequence of the audio signal includes:
calculating mel-frequency cepstrum coefficient values of the audio signal;
and inputting the Mel frequency cepstrum coefficient value into a second recurrent neural network, and extracting the characteristic sequence of the audio signal.
Optionally, the second recurrent neural network comprises three long-short term memory layers;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism, and the audio encoder fuses the state data of the top long-short term memory layer and the characteristic sequence of the video signal in the top long-short term memory layer of the second recurrent neural network based on the attention mechanism.
Optionally, the first decoder outputs a pinyin character sequence according to the fused feature sequence, and includes the following steps:
the first decoder inputs the fused feature sequence into a third recurrent neural network based on an attention mechanism, wherein the recurrent neural network comprises a feature extraction layer and a classification layer;
and obtaining the pinyin character sequence from the output of the classification layer of the third recurrent neural network.
Optionally, the chinese character sequence recognition model includes an encoder and a second decoder;
inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model, and the method comprises the following steps:
the encoder inputs the pinyin character sequence into a fourth recurrent neural network to extract the characteristics of the pinyin character sequence;
inputting the characteristics of the pinyin character sequence into a fifth cyclic neural network by the second decoder, wherein the fifth cyclic neural network comprises a characteristic extraction layer and a classification layer;
and obtaining the Chinese character sequence from the output of the classification layer of the fifth cyclic neural network.
Optionally, the fourth recurrent neural network and the fifth recurrent neural network each include two gated recurrent unit layers.
Optionally, the method further includes training the pinyin character sequence recognition model and the hanzi sequence recognition model by:
fixing the parameters of the pinyin character sequence recognition model, and training by adjusting the parameters of the Chinese character sequence recognition model;
fixing the parameters of the Chinese character sequence recognition model, and training by adjusting the parameters of the pinyin character sequence recognition model;
and training the whole network of the pinyin character sequence recognition model and the Chinese character sequence recognition model.
The embodiment of the invention also provides a system for recognizing the combination of Chinese audio-visual and speech, which is characterized in that the system is applied to the method for recognizing the combination of Chinese audio-visual and speech, and comprises the following steps:
the signal receiving module is used for respectively receiving a video signal and an audio signal to be identified;
the pinyin identification module is used for inputting the video signal and the audio signal into a trained pinyin character sequence identification model to obtain a pinyin character sequence output by the pinyin character sequence identification model;
and the Chinese character recognition module is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
The embodiment of the invention also provides a device for recognizing Chinese audio-visual combined speech, which comprises:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the Chinese audiovisual combined with speech recognition method via execution of the executable instructions.
The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the Chinese audio-visual combined speech recognition method when being executed.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
The Chinese audio-visual combined voice recognition method, the system, the equipment and the medium provided by the invention have the following advantages:
the invention solves the problems in the prior art, provides a scheme of audio-visual combined speech recognition aiming at Chinese sentence level based on an end-to-end mode, and fills the blank of the technical field; furthermore, the machine learning model combining audio and video with voice recognition is constructed by utilizing the deep neural network, the whole process does not need to manually extract the characteristics, and compared with the traditional machine learning mode, the method can well extract the characteristics of audio and video signals and is beneficial to improving the recognition capability of a voice recognition system; furthermore, the invention combines the deep neural network and the attention mechanism, and fully excavates and fuses the characteristics of the audio signal and the video signal, so that the recognition effect is greatly improved compared with a multi-mode characteristic splicing mode.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method for Chinese audio visual combined with speech recognition according to an embodiment of the present invention;
FIG. 2 is a flow diagram of the recognition of sentences from audio and video signals in accordance with an embodiment of the present invention;
FIG. 3 is a diagram illustrating a structure of a Pinyin character sequence recognition model according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a Chinese character sequence recognition model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a Chinese audio-visual system combined with speech recognition according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a Chinese audiovisual combined with speech recognition apparatus in accordance with an embodiment of the present invention;
fig. 7 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
As shown in fig. 1, in order to solve the problems of the prior art, the present invention provides a method for combining chinese audio-visual with speech recognition, which comprises the following steps:
s100: respectively receiving a video signal and an audio signal to be identified;
s200: inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model;
s300: and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
The invention provides an end-to-end audio-visual combined speech recognition method, which can directly obtain a corresponding Chinese character sequence, namely a sentence comprising speaking content, by inputting a video signal and an audio signal. Moreover, the invention is specially used for Chinese character recognition, the pinyin character sequence is obtained from the video signal and the audio signal through the step S200, then Chinese character recognition is carried out through the step S300, the audio signal and the video signal are fully mined and fused, and the recognition effect is greatly improved.
As shown in fig. 2, the pinyin character sequence recognition model inputs a sequence of picture frames, i.e., a video signal, of an audio signal when a person speaks and a lip motion video of the speaker. The pinyin character sequence recognition model performs fusion processing on input audio signals and video signals, and finally outputs a pinyin character sequence, and the process can be expressed by the following formula:
p=av2p(a,v) (1)
wherein a ═ a1,a2,…,an) Representing a sequence of audio signals, v ═ v1,v2,…,vm) Picture frame sequence representing lip motion, p ═ p (p)1,p2,…,pk) Represents the recognized pinyin character sequence, and av2p (.) represents a pinyin character sequence recognition model that converts audio and video signals into pinyin character sequences.
The input of the Chinese character sequence recognition model is the pinyin character sequence output by the pinyin character sequence recognition model, and the Chinese character sequence is output by learning the characteristics of the pinyin character sequence, namely the Chinese character sequence represents the speaking content, and the process can be represented by the following formula:
s=p2s(p) (2)
wherein p ═ p (p)1,p2,…,pk) Representing the input pinyin character sequence, s ═ c1,c2,…,cl) Representing a recognized sequence of Chinese characters, i.e. sentences, ciRepresents the ith Chinese character in the sentence s, and p2s (.) represents the Chinese character sequence recognition model that converts the pinyin character sequence to a Chinese character sequence.
The following describes the working process of the pinyin character sequence recognition model and the hanzi sequence recognition model in a specific embodiment with reference to fig. 3 and 4.
As shown in fig. 3, the pinyin character sequence recognition model includes a video encoder, an audio encoder, and a first decoder. Specifically, in this embodiment, the pinyin character Sequence recognition model may be a Sequence to Sequence (Sequence to Sequence) model based on the attention mechanism.
The step S200: inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model, and the method comprises the following steps:
s210: the video encoder extracts a characteristic sequence of the video signal;
s220: the audio encoder extracts a characteristic sequence of the audio signal;
s230: the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence;
s240: and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.
The step S210: the video encoder extracting the characteristic sequence of the video signal comprises the following steps:
inputting the image frame sequence of the video signal into a convolutional neural network to obtain the image characteristics of each frame of image output by the convolutional neural network; the convolutional neural Network may be a Residual Network (ResNet), and an input of the Residual Network may be a three-channel RGB image; convolutional Neural Networks (CNN) are a class of feed-forward neural networks that contain convolution computations and have a depth structure, and residual networks are characterized by being easily optimized and capable of increasing accuracy by adding a considerable depth. The inner residual block uses jump connection, so that the problem of gradient disappearance caused by depth increase in a deep neural network is solved;
inputting the image features into a first recurrent neural network, extracting time sequence features among the image frame sequences, and taking an output sequence of the first recurrent neural network as a feature sequence of the video signal; in this embodiment, the first recurrent neural network may be a three-layer Long-Short Term Memory (LSTM) network, which is used to extract the time-series features between image frame sequences, and take the output sequence of the top-layer LSTM as the feature sequence of the video signal output by the video encoder. The recurrent neural network is a recurrent neural network which takes sequence data as input, recurs in the evolution direction of the sequence and all nodes (recurrent units) are connected in a chain manner; the long-term and short-term memory Network is a time-cycle Neural Network (RNN), and is a special RNN (Recurrent Neural Network) capable of learning long-term dependency.
The step S220: the audio encoder extracts the characteristic sequence of the audio signal, and comprises the following steps:
calculating Mel Frequency cepstrum coefficient values (MFCC) of the audio signal; Mel-Frequency Cepstral coeffients (Mel Frequency Cepstral coeffients) is a widely used feature in automated speech and speaker recognition;
inputting the mel frequency cepstrum coefficient value into a second recurrent neural network, and extracting a characteristic sequence of the audio signal; in this embodiment, the second recurrent neural network may also be a three-layer long-short term memory network, and the state data output by the top layer is used as the audio signal.
The step S230: the audio encoder fuses the feature sequence of the video signal and the feature sequence of the audio signal through an attention mechanism, and the method comprises the following steps:
the audio encoder fuses state data of the top long short-term memory layer and the feature sequence of the video signal based on an attention mechanism in the top long short-term memory layer of the second recurrent neural network. Specifically, the fusion employs the following equations (3) and (4):
aij=score(valuej,queryi) (3)
Ci=∑jaijvaluej(4)
wherein, query represents the state data of the top LSTM of the audio encoder, value represents the output of the top LSTM of the video encoder, and the output of the top LSTM of the audio encoder is the fusion characteristic of the audio and video signals through the fusion of the attention mechanism.
The step S240: the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence, and the method comprises the following steps:
the first decoder inputs the fused feature sequence into a third recurrent neural network based on an attention mechanism, wherein the recurrent neural network comprises a feature extraction layer and a classification layer; in this embodiment, the feature extraction layer in the third recurrent neural network may be a single-layer long-short term memory layer, and a four-head attention mechanism is used to improve the performance, the classification layer may be a softmax layer, and the classification layer outputs the predicted pinyin character sequence p ═ (p ═ p)1,p2,…,pk)。
And obtaining the pinyin character sequence from the output of the classification layer of the third recurrent neural network.
When the pinyin character sequence recognition model is trained, the input of the model is g ═ g (g)1,g2,…,gk),giThe sample label representing the real sample is processed by an LSTM layer and a softmax layer, and finally, a predicted pinyin character sequence p ═ p (p) is output1,p2,…,pk). In the test phase, giRepresenting the output p of the network at time ii。
As shown in fig. 4, in this embodiment, the kanji sequence recognition model includes an encoder and a second decoder. Specifically, in this embodiment, the chinese character sequence recognition model is also implemented based on the Seq2Seq framework. The step S300: inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model, and the method comprises the following steps:
the encoder combines the pinyinInputting the character sequence into a fourth recurrent neural network to extract the characteristics of the pinyin character sequence; the input of the encoder is a pinyin character sequence p ═ p (p)1,p2,…,pk),piIn this embodiment, the fourth Recurrent neural network includes two layers of Gated Recurrent Unit (GRU) networks, and the output sequence is O ═ O (O ═ g ═ O1,O2,…,Ok) Output at time iiFor parameterizing the input p at the next momenti+1The predicted distribution of (2); the gated cyclic unit is a commonly used gated cyclic neural network, and the gated cyclic neural network is proposed to better capture the dependence relationship with larger time step distance in a time sequence.
Inputting the characteristics of the pinyin character sequence into a fifth cyclic neural network by the second decoder, wherein the fifth cyclic neural network comprises a characteristic extraction layer and a classification layer; in this embodiment, the fifth recurrent neural network may include two layers of gated recurrent units, and the classification layer may be a softmax layer;
obtaining the kanji sequence s ═ (c) from the output of the classification layer of the fifth recurrent neural network1,c2,…,cl)。
In the training stage of the Chinese character sequence recognition model, the input of the decoder is the Chinese character sequence y ═ y1,y2,…,yl),yiAnd inputting the real sample label into a fifth-cycle neural network for calculation.
The Chinese audio-visual combined speech recognition system constructed by the invention comprises two models: the pinyin character sequence recognition model and the Chinese character sequence recognition model are formed and are of a multi-model structure, and the problem of multi-model training needs to be solved for obtaining a final recognition system. In this embodiment, the method for recognizing Chinese audio-visual combined with speech further comprises training the pinyin character sequence recognition model and the hanzi sequence recognition model by the following steps:
fixing the parameters of the pinyin character sequence recognition model, and training by adjusting the parameters of the Chinese character sequence recognition model until the Chinese character sequence recognition model is converged;
fixing the parameters of the Chinese character sequence recognition model, and training by adjusting the parameters of the pinyin character sequence recognition model until the pinyin character sequence recognition model is converged;
and training the whole network of the pinyin character sequence recognition model and the Chinese character sequence recognition model to converge, thus obtaining the whole Chinese audio-visual combined speech recognition system.
As shown in fig. 5, an embodiment of the present invention further provides a system for combining a chinese audio visual system with a speech recognition, where the system is applied to the method for combining a chinese audio visual system with a speech recognition, and the system includes:
a signal receiving module M100, configured to receive a video signal and an audio signal to be identified respectively;
the pinyin identification module M200 is used for inputting the video signal and the audio signal into a trained pinyin character sequence identification model to obtain a pinyin character sequence output by the pinyin character sequence identification model;
and the Chinese character recognition module M300 is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
The invention provides an end-to-end audio-visual combined voice recognition system, which inputs video signals and audio signals and can directly obtain corresponding Chinese character sequences by the method of the invention. Moreover, the invention is specially used for Chinese character recognition, a pinyin character sequence is obtained from a video signal and an audio signal through the pinyin recognition module M200, and then Chinese character recognition is carried out through the Chinese character recognition module M300, so that the audio signal and the video signal are fully mined and fused, and the recognition effect is greatly improved.
The pinyin character sequence recognition model of the invention may have a structure as shown in fig. 3, and the chinese character sequence recognition model may have a structure as shown in fig. 4. The embodiment of the pinyin identification module M200 for obtaining the pinyin character sequence may adopt the specific implementation of the steps S210 to S240 as described above, but the invention is not limited thereto. The embodiment of the kanji sequence recognition module M300 for obtaining the kanji sequence may adopt the above-mentioned specific embodiment of step S300, but the present invention is not limited thereto.
The embodiment of the invention also provides a Chinese audio-visual combined voice recognition device, which comprises a processor; a memory having stored therein executable instructions of the processor; wherein the processor is configured to perform the steps of the Chinese audiovisual combined with speech recognition method via execution of the executable instructions.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.
An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The combination of the electronic device 600 may include, but is not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting different platform combinations (including memory unit 620 and processing unit 610), a display unit 640, etc.
Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.
The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the Chinese audio-visual combined speech recognition method when being executed. In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification, when the program product is run on the terminal device.
Referring to fig. 7, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, compared with the prior art, the method, system, device and medium for recognizing Chinese audio-visual combined speech provided by the invention have the following advantages:
the invention solves the problems in the prior art, provides a scheme of audio-visual combined speech recognition aiming at Chinese sentence level based on an end-to-end mode, and fills the blank of the technical field; furthermore, the machine learning model combining audio and video with voice recognition is constructed by utilizing the deep neural network, the whole process does not need to manually extract the characteristics, and compared with the traditional machine learning mode, the method can well extract the characteristics of audio and video signals and is beneficial to improving the recognition capability of a voice recognition system; furthermore, the invention combines the deep neural network and the attention mechanism, and fully excavates and fuses the characteristics of the audio signal and the video signal, so that the recognition effect is greatly improved compared with a multi-mode characteristic splicing mode.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (12)
1. A Chinese audio-visual combined speech recognition method is characterized by comprising the following steps:
respectively receiving a video signal and an audio signal to be identified;
inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model;
and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
2. The method of claim 1, wherein the pinyin character sequence recognition model includes a video encoder, an audio encoder, and a first decoder;
inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model, and the method comprises the following steps:
the video encoder extracts a characteristic sequence of the video signal;
the audio encoder extracts a characteristic sequence of the audio signal;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence;
and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.
3. The method of claim 2, wherein the video encoder extracts the feature sequence of the video signal, and comprises the following steps:
inputting the image frame sequence of the video signal into a convolutional neural network to obtain the image characteristics of each frame of image output by the convolutional neural network;
inputting the image features into a first recurrent neural network, extracting time sequence features among the image frame sequences, and taking an output sequence of the first recurrent neural network as a feature sequence of the video signal.
4. The method of claim 2, wherein the audio encoder extracts the feature sequence of the audio signal, and comprises the following steps:
calculating mel-frequency cepstrum coefficient values of the audio signal;
and inputting the Mel frequency cepstrum coefficient value into a second recurrent neural network, and extracting the characteristic sequence of the audio signal.
5. The method of claim 4, wherein the second recurrent neural network comprises three layers of long-short term memory;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism, and the audio encoder fuses the state data of the top long-short term memory layer and the characteristic sequence of the video signal in the top long-short term memory layer of the second recurrent neural network based on the attention mechanism.
6. The method of claim 2, wherein the first decoder outputs a pinyin character sequence according to the fused feature sequence, comprising the steps of:
the first decoder inputs the fused feature sequence into a third recurrent neural network based on an attention mechanism, wherein the recurrent neural network comprises a feature extraction layer and a classification layer;
and obtaining the pinyin character sequence from the output of the classification layer of the third recurrent neural network.
7. The method of claim 1, wherein the Chinese character recognition model comprises an encoder and a second decoder;
inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model, and the method comprises the following steps:
the encoder inputs the pinyin character sequence into a fourth recurrent neural network to extract the characteristics of the pinyin character sequence;
inputting the characteristics of the pinyin character sequence into a fifth cyclic neural network by the second decoder, wherein the fifth cyclic neural network comprises a characteristic extraction layer and a classification layer;
and obtaining the Chinese character sequence from the output of the classification layer of the fifth cyclic neural network.
8. The method of claim 7, wherein the fourth recurrent neural network and the fifth recurrent neural network respectively comprise two gated recurrent unit layers.
9. The method of claim 1, further comprising training the pinyin character sequence recognition model and the hanzi sequence recognition model by:
fixing the parameters of the pinyin character sequence recognition model, and training by adjusting the parameters of the Chinese character sequence recognition model;
fixing the parameters of the Chinese character sequence recognition model, and training by adjusting the parameters of the pinyin character sequence recognition model;
and training the whole network of the pinyin character sequence recognition model and the Chinese character sequence recognition model.
10. A system for chinese audio-visual speech recognition, which is applied to the method for chinese audio-visual speech recognition according to any one of claims 1 to 9, the system comprising:
the signal receiving module is used for respectively receiving a video signal and an audio signal to be identified;
the pinyin identification module is used for inputting the video signal and the audio signal into a trained pinyin character sequence identification model to obtain a pinyin character sequence output by the pinyin character sequence identification model;
and the Chinese character recognition module is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
11. A chinese audio-visual combined speech recognition apparatus, comprising:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the Chinese audiovisual combined speech recognition method of any of claims 1-9 via execution of the executable instructions.
12. A computer-readable storage medium storing a program, wherein the program when executed implements the steps of the chinese audiovisual in combination with speech recognition method of any of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911297060.3A CN111161724B (en) | 2019-12-16 | 2019-12-16 | Method, system, equipment and medium for Chinese audio-visual combined speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911297060.3A CN111161724B (en) | 2019-12-16 | 2019-12-16 | Method, system, equipment and medium for Chinese audio-visual combined speech recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111161724A true CN111161724A (en) | 2020-05-15 |
CN111161724B CN111161724B (en) | 2022-12-13 |
Family
ID=70557201
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911297060.3A Active CN111161724B (en) | 2019-12-16 | 2019-12-16 | Method, system, equipment and medium for Chinese audio-visual combined speech recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111161724B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112102830A (en) * | 2020-09-14 | 2020-12-18 | 广东工业大学 | Coarse granularity instruction identification method and device |
CN112786052A (en) * | 2020-12-30 | 2021-05-11 | 科大讯飞股份有限公司 | Speech recognition method, electronic device and storage device |
CN113033538A (en) * | 2021-03-25 | 2021-06-25 | 北京搜狗科技发展有限公司 | Formula identification method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040117191A1 (en) * | 2002-09-12 | 2004-06-17 | Nambi Seshadri | Correlating video images of lip movements with audio signals to improve speech recognition |
CN101825953A (en) * | 2010-04-06 | 2010-09-08 | 朱建政 | Chinese character input product with combined voice input and Chinese phonetic alphabet input functions |
CN102347026A (en) * | 2011-07-04 | 2012-02-08 | 深圳市子栋科技有限公司 | Audio/video on demand method and system based on natural voice recognition |
CN108073875A (en) * | 2016-11-14 | 2018-05-25 | 广东技术师范学院 | A kind of band noisy speech identifying system and method based on monocular cam |
CN109410918A (en) * | 2018-10-15 | 2019-03-01 | 百度在线网络技术(北京)有限公司 | For obtaining the method and device of information |
-
2019
- 2019-12-16 CN CN201911297060.3A patent/CN111161724B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040117191A1 (en) * | 2002-09-12 | 2004-06-17 | Nambi Seshadri | Correlating video images of lip movements with audio signals to improve speech recognition |
CN101825953A (en) * | 2010-04-06 | 2010-09-08 | 朱建政 | Chinese character input product with combined voice input and Chinese phonetic alphabet input functions |
CN102347026A (en) * | 2011-07-04 | 2012-02-08 | 深圳市子栋科技有限公司 | Audio/video on demand method and system based on natural voice recognition |
CN108073875A (en) * | 2016-11-14 | 2018-05-25 | 广东技术师范学院 | A kind of band noisy speech identifying system and method based on monocular cam |
CN109410918A (en) * | 2018-10-15 | 2019-03-01 | 百度在线网络技术(北京)有限公司 | For obtaining the method and device of information |
Non-Patent Citations (1)
Title |
---|
谢磊等: "一种基于数据筛的音频视频连续语音识别***", 《计算机应用》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112102830A (en) * | 2020-09-14 | 2020-12-18 | 广东工业大学 | Coarse granularity instruction identification method and device |
CN112786052A (en) * | 2020-12-30 | 2021-05-11 | 科大讯飞股份有限公司 | Speech recognition method, electronic device and storage device |
CN112786052B (en) * | 2020-12-30 | 2024-05-31 | 科大讯飞股份有限公司 | Speech recognition method, electronic equipment and storage device |
CN113033538A (en) * | 2021-03-25 | 2021-06-25 | 北京搜狗科技发展有限公司 | Formula identification method and device |
CN113033538B (en) * | 2021-03-25 | 2024-05-10 | 北京搜狗科技发展有限公司 | Formula identification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111161724B (en) | 2022-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10380996B2 (en) | Method and apparatus for correcting speech recognition result, device and computer-readable storage medium | |
CN108985358B (en) | Emotion recognition method, device, equipment and storage medium | |
US11741355B2 (en) | Training of student neural network with teacher neural networks | |
CN108170749B (en) | Dialog method, device and computer readable medium based on artificial intelligence | |
WO2021072875A1 (en) | Intelligent dialogue generation method, device, computer apparatus and computer storage medium | |
CN111402861B (en) | Voice recognition method, device, equipment and storage medium | |
US11610108B2 (en) | Training of student neural network with switched teacher neural networks | |
US20150325240A1 (en) | Method and system for speech input | |
CN114694076A (en) | Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion | |
CN110415679B (en) | Voice error correction method, device, equipment and storage medium | |
WO2022134894A1 (en) | Speech recognition method and apparatus, computer device, and storage medium | |
CN111161724B (en) | Method, system, equipment and medium for Chinese audio-visual combined speech recognition | |
KR20170022445A (en) | Apparatus and method for speech recognition based on unified model | |
CN114596844B (en) | Training method of acoustic model, voice recognition method and related equipment | |
CN110263218B (en) | Video description text generation method, device, equipment and medium | |
CN114676234A (en) | Model training method and related equipment | |
CN109726397B (en) | Labeling method and device for Chinese named entities, storage medium and electronic equipment | |
CN110991175B (en) | Method, system, equipment and storage medium for generating text in multi-mode | |
CN104882141A (en) | Serial port voice control projection system based on time delay neural network and hidden Markov model | |
CN115983294B (en) | Translation model training method, translation method and translation equipment | |
JP7178394B2 (en) | Methods, apparatus, apparatus, and media for processing audio signals | |
WO2023093295A1 (en) | Artificial intelligence-based audio processing method and apparatus, electronic device, computer program product, and computer-readable storage medium | |
CN110377778A (en) | Figure sort method, device and electronic equipment based on title figure correlation | |
WO2023082931A1 (en) | Method for punctuation recovery in speech recognition, and device and storage medium | |
CN110647613A (en) | Courseware construction method, courseware construction device, courseware construction server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PP01 | Preservation of patent right |
Effective date of registration: 20230131 Granted publication date: 20221213 |
|
PP01 | Preservation of patent right | ||
PD01 | Discharge of preservation of patent |
Date of cancellation: 20240108 Granted publication date: 20221213 |
|
PD01 | Discharge of preservation of patent | ||
PP01 | Preservation of patent right |
Effective date of registration: 20240227 Granted publication date: 20221213 |
|
PP01 | Preservation of patent right |