CN116580278A

CN116580278A - Lip language identification method, equipment and storage medium based on multi-attention mechanism

Info

Publication number: CN116580278A
Application number: CN202310562028.3A
Authority: CN
Inventors: 张晖; 杨胜; 宝音都古楞; 飞龙; 巩政
Original assignee: Inner Mongolia University
Current assignee: Inner Mongolia University
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2023-08-11

Abstract

The invention discloses a lip language identification method, equipment and storage medium based on a multi-attention mechanism, wherein the lip language identification method comprises the following steps: preprocessing a video data set to obtain continuous character mouth gray level images, and performing data enhancement processing at the same time; performing preliminary feature extraction on continuous lip images through a time domain convolutional neural network, and performing deep feature extraction through a residual convolutional neural network based on a frequency domain attention mechanism; encoding the extracted features by a convolution enhanced transducer encoder; performing mixed CTC/Attention decoding on the characteristics; training through the constructed model and the loss function of the mixed CTC/Attention; the output of the model is further improved by the RNN-based language model. The invention improves the accuracy of lip language identification and simultaneously improves the speed of lip language identification.

Description

Lip language identification method, equipment and storage medium based on multi-attention mechanism

Technical Field

The invention belongs to the technical field of image recognition, and relates to a lip language recognition method, equipment and a storage medium based on a multi-attention mechanism.

Background

Lip recognition is a technique for recognizing what is said based on continuous changes in lip area. There are extremely important effects in various fields, such as: the method can improve the voice recognition of the voice in a noisy environment, help the communication and the sound filling of the soundless monitoring of the postnatal deaf-mute and the normal person.

In the lip language recognition, the traditional machine learning model has poor effect, along with the continuous development of deep learning, the earliest researchers design a model applying 3D convolution and adding a deeper network ResNet residual neural network) and can capture time sequence characteristics LSTM (long and short time memory neural network) to perform lip language recognition, and although the effect is greatly improved compared with the traditional machine learning, the effect is still difficult to satisfy.

Later, researchers designed a time domain convolutional network (Temporal Convolutional Network, abbreviated as TCN) to predict, and improved by using a proper training strategy, namely, a Densely connected time convolutional network (Densely-Connected Temporal Convolutional Networks, abbreviated as DC-TCn) is used and designed to predict, so that the lip recognition effect is further improved.

Current researchers have added a multi-headed attention mechanism to improve the recognition effect of lip recognition on homonyms, and designed encoders based on visual transducer pooling (Visual Transformer Pooling, VTP for short) structures, which bring the accuracy of lip recognition to a new level. However, the model parameter amount of the technology is large, and the technology is not applicable to a scene of real-time lip language identification.

Disclosure of Invention

In order to solve the problems, the invention provides a lip language identification method based on a multi-attention mechanism, which improves the accuracy of lip language identification, improves the speed of lip language identification and solves the problems in the prior art.

A second object of the present invention is to provide an electronic device.

A third object of the present invention is to provide a computer storage medium.

The technical scheme adopted by the invention is that the lip language identification method based on the multi-attention mechanism comprises the following steps:

step 1, preprocessing a video data set to obtain continuous character mouth gray level images, and performing data enhancement processing at the same time;

step 2, performing preliminary feature extraction on continuous lip images through a time domain convolutional neural network, and then performing deep feature extraction through a residual convolutional neural network based on a frequency domain attention mechanism;

step 3, encoding the extracted features by a convolution enhancement transducer encoder;

step 4, performing mixed CTC/Attention decoding on the characteristics;

step 5, dividing the data obtained in the step 1 into training sets according to the proportion, and training the training data through the model constructed in the step 2-4 and the loss function of the mixed CTC/Attention;

and 6, further improving the output result of the model through the language model based on the RNN.

Further, the step 1 includes:

step 1-1, separating image data of each frame in a video data set to obtain continuous image data corresponding to a video;

step 1-2, carrying out gray processing on continuous image data to eliminate the influence caused by lip color, carrying out face detection through a face recognition library, marking face key points, obtaining lip center point coordinates according to the coordinates of the lip key points, and cutting out lip images comprising all lips, part of chin and part of environment by taking the center point as an origin;

step 1-3, carrying out data enhancement processing on the lip images, carrying out random horizontal or vertical overturn on the lip images of each frame, and randomly covering partial areas;

and step 1-4, matching the lip continuous images obtained in the step 1-3 with corresponding text contents, and directly taking the corresponding text contents as labels of the continuous images.

Further, in the step 2, a 2-dimensional time domain convolutional neural network is adopted, feature extraction is performed through causal convolution and cavity convolution, and feature values of a lower layer of causal convolution come from several adjacent feature graphs of an upper layer so as to extract features of a time sequence; the shapes of the two-dimensional convolution kernels of the hole convolution acting on the lip image are not connected together, but are spaced apart to maximize the extraction of global information.

Further, in the step 2, a residual convolution neural network based on a frequency domain attention mechanism converts a multi-channel feature map output by each Block of the Res-Net into a corresponding frequency domain map, and a value at a certain position in the frequency domain map is selected, and the value is learned through a feedforward neural network to finally obtain the weight of the channel; the channel weight is multiplied by the feature map on the channel to play a role in attention.

Further, in the step 3, the convolutional enhanced transducer encoder includes an emdecoding module and a set of transducer models; the Emdedding module comprises a convolution downsampling layer and a linear layer, wherein the convolution downsampling layer reduces the feature dimension, and then the feature is mapped into D through the linear layer _k Dimension; the Conformer model is formed by sequentially stacking a feedforward neural network module, a multi-head self-attention module, a convolution module and a feedforward neural network module, wherein each module is connected with a normalization layer and then a random inactivation layer, residual error chains are connected in each module, and residual error data are input data so as to prevent gradient explosion and overfitting.

Further, the feedforward neural network module is composed of a plurality of d-dimensional linear layers, each linear layer comprises a Swish activation function layer and a Dropout layer, the Swish activation function f (x) =x.sigma (x),

where x is the characteristic of the input，σ (x) represents an intermediate parameter, e being a constant.

The multi-head self-attention module takes as input a query Q, a key K and a value V, wherein T represents the length of the signature sequence, d _Q 、d _k and d_v Dimensions of query, key, and value, respectively; let q=k=v in the encoder, and W _i ^Q 、/> and />Output matrix f of the ith self-attention, linear transformation weights denoted Q, K and V, respectively _i (Q′ _i ，K′ _i ，V′ _i ) The calculation formula of (2) is as follows:

wherein ,Q′_i ＝QW _i ^Q ，Q′ _i 、K′ _i 、V′ _i Respectively representing a query vector, a key vector and a value vector;

the convolution module comprises a point-by-point convolution layer, a GLU activation function, a one-dimensional depth convolution layer, a Swish activation function layer, and finally a point-by-point convolution layer and a normalization layer.

Further, in the step 4, the hybrid CTC/Attention decoding includes two decoders, one is a transform end-to-end based decoder, which includes six layers of Transformer Decoder basic blocks and is trained with cross entropy loss; the other decoder relies on the linear layer of the joint sense time sequence classification to carry out training decoding, comprises 4 four layers of linear layers and corresponding ReLu activation functions, outputs the CTC posterior probability of each input frame, and the whole stack is subjected to CTC loss training.

Further, in the step 4, the loss function of the mixed CTC/Attention is as follows:

wherein ,representing the loss function, p, of the mixed CTC/Attention _CTC Representing the probability of getting x and counting y under the condition of x, i.e., conditional probability, by CTC decoder; x is the output of the linear layer, y is the real label corresponding to the feature; p is p _CE Representing a loss value based on an attention mechanism; alpha is the weight of CTC loss and (1-alpha) is the loss weight based on the attention mechanism.

An electronic device adopts the method to realize lip language identification.

A computer storage medium having stored therein at least one program instruction that is loaded and executed by a processor to implement the multi-attention mechanism based lip language identification method described above.

The beneficial effects of the invention are as follows:

according to the embodiment of the invention, through using 2-dimensional time domain convolution and adding a frequency domain attention mechanism and a multi-head attention mechanism, the lip language recognition effect is further improved, the error rate is reduced, and meanwhile, a lightweight residual convolution neural network (ResNet-18) and a small-scale convolution enhancement transducer model (small Conformer) are used, so that the overall parameter number of the model is reduced; the accuracy rate of lip language identification is improved, and meanwhile, the speed of lip language identification is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an overall flow chart of an embodiment of the present invention.

Fig. 2 is a schematic diagram of a 2-dimensional time domain convolutional neural network in an embodiment of the present invention.

Fig. 3 is a schematic diagram of a frequency domain attention mechanism according to an embodiment of the present invention.

Fig. 4 is a configuration diagram of a Conformer in an embodiment of the present invention.

FIG. 5 is a schematic diagram of exemplary output feature sizes in accordance with an embodiment of the present invention.

FIG. 6 is a diagram of an overall model structure in an embodiment of the invention.

FIG. 7 is a diagram of an RNN-based language model according to an embodiment of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

A lip language identification method based on a multi-attention mechanism is shown in fig. 1, and comprises three parts of data processing, model construction and improved results. In terms of data processing, a gray scale image of the mouth region is segmented and a data enhancement method is used to prevent model overfitting. In terms of model construction, feature extraction and feature encoding are performed on continuous pictures of lips through a 2-dimensional time domain convolutional neural network (2D Temporal Convolutional Networks, abbreviated as 2D-TCN), a convolutional neural network of a frequency domain attention mechanism and a convolutional enhanced transducer encoder (Convolume-augmented Transformer, abbreviated as Conformer) of a multi-head attention mechanism, features are decoded by using two decoders, and values obtained by corresponding loss functions are weighted. Improving the results, the RNN-based language model is used to improve the final results. To increase model recognition speed, a lightweight residual convolutional neural network (ResNet-18) is used as the base model for the frequency domain attention mechanism, and a small-scale convolutional enhanced transducer model (small Conformer) is used.

The method specifically comprises the following steps:

step 1, preprocessing a video data set to obtain continuous character mouth gray level images, and simultaneously, performing data enhancement processing to improve generalization capability of a model.

In step 1-1, the image data of each frame in the video dataset is separated by using Fast Forward Mpeg (ffmpeg) program, so as to obtain continuous image data corresponding to the video.

Step 1-2, carrying out gray processing on the continuous image data in the step 1-1, eliminating the influence caused by lip color, carrying out face detection by using a cross-platform face library recognition library Dlib developed by C++ programming, marking 68 key points on the face, obtaining coordinates of a lip center point according to the coordinates of the lip key points, cutting out lip images comprising all lips, part of chin and part of environment by taking the center point as an origin, and obtaining the lip images with the size of 120px multiplied by 120px.

Step 1-3, carrying out data enhancement processing on the lip images, carrying out random horizontal or vertical overturn on the lip images of each frame, and randomly covering partial areas (10 percent); and preventing the data from being over-fitted in the training process.

Step 1-4, matching the lip continuous images obtained in step 1-3 with corresponding text contents, wherein the corresponding text sentence contents are directly used as labels of the continuous images due to sentence-level lip recognition.

And 2, performing preliminary feature extraction on continuous lip images through a 2-dimensional time domain convolutional neural network (2D Temporal Convolutional Networks, 2D-TCN for short), and then performing deep feature extraction by using a residual convolutional neural network based on a frequency domain attention mechanism.

The 2D-TCN performs feature extraction through causal convolution and cavity convolution, wherein the causal convolution is shown in fig. 2, and the feature values of the lower layer come from several adjacent feature graphs of the upper layer, so that the features of the time sequence can be extracted; the shape of the two-dimensional convolution kernel acting on the lip image is not connected together, but is spaced apart, as shown in fig. 2, so that global information can be extracted to the greatest extent.

As shown in fig. 3, the residual convolution neural network of the frequency domain attention mechanism converts the multi-channel feature map output by each Block of Res-Net into a corresponding frequency domain map, and selects a value at a certain position in the frequency domain map, and the value is learned through the feedforward neural network, so that the weight of the channel can be obtained finally; as shown in fig. 3, the time-space domain diagram is converted to a frequency domain diagram using a 2-dimensional DCT discrete cosine transform, the DCT transform formula is as follows:

s.t.h∈{0，1，…，H-1}，w∈{0，1，…，W-1}，

wherein H and W represent the height and width of the image, respectively,original image value representing the position of coordinates (i, j) in the 2-dimensional image, +.>Is the 2-dimensional DCT frequency domain value of the (h, w) position in the 2-dimensional image. The channel weight is multiplied by the characteristic diagram on the channel to play a role in attention, namely, the excellent characteristic diagram is reinforced, and the poor characteristic diagram is weakened. The function of the 2-dimensional time domain convolution in the invention is to capture the feature of continuous change of the lip, compared with the 3-dimensional convolution, the causal convolution in the 2-dimensional time domain convolution can capture longer sequence information, and meanwhile, the cavity convolution in the 2-dimensional time domain convolution has a larger receptive field and can capture more features.

Step 3, the feature obtained in step 2 is encoded by using a Convolution enhancement transducer encoder (Convolume-augmented Transformer, conformer for short).

The configuration of the Conformer encoder is shown in FIG. 4, and consists of an Emdedding module and a set of Conformer models. The Emdedding module consists of a convolution downsampling layer which reduces the feature dimension to 1/4 of the original dimension and a linear layer which then maps the feature to D _k Dimension. The structure of the Conformer model is shown in the dashed box in FIG. 4, and is composed of a feedforward neural network module, a Multi-headed Self-attention (MHSA) module, a convolution module and a feedforward neural network module which are sequentially stacked, wherein in order to prevent gradient explosion and over fitting, each module is a layer normalization (Layerneorm) followed by a random inactivation (Dropout) layer, and each module is internally connected with a residual chain, and residual data is input data. Particularly, the feedforward neural network module consists of a plurality of d-dimensional linear layers, wherein a Swish activation function layer and a Dropout layer are contained between each linear layer, and compared with other activation functions, the Swish activation function can better handle the nonlinear problem, and the mathematical formula of the Swish activation function is as follows: f (x) =x·σ (x),

where f (x) denotes the Swish activation function, x is the characteristic of the input,sigma (x) represents an intermediate parameter, x is restricted to between 0 and 1, and e is a constant.

The multi-head self-attention module takes as input a query Q, a key K and a value V, wherein Representing a two-dimensional real matrix, T representing the length of the signature sequence, d _Q 、d _k and d_v The dimensions of the query, key, and value, respectively. Let q=k=v in the encoder, and W _i ^Q 、/> and />Output matrix f of the ith self-attention, linear transformation weights denoted Q, K and V, respectively _i (Q′ _i ，K′ _i ，V′ _i ) The calculation formula of (2) is as follows:

wherein ,Q′_i ＝QW _i ^Q ，Iterative attention mechanism model to obtain new Q, K, V, Q' _i 、K′ _i 、V′ _i Respectively representing a query vector, a key vector, and a value vector.

The convolution module comprises a point-by-point convolution layer with a spreading factor of 2, a GLU activation function, a 1-dimensional depth convolution layer, a Swish activation function layer, and finally a point-by-point convolution layer and a normalization layer.

The formula of the Conformer module is as follows:

x″ _i ＝x′ _i +Conv(x′ _i )

wherein FFN (x) represents a feed-forward neural network, MHSA (x) represents a multi-head attention mechanism, conv (x) represents a 1-dimensional convolutional neural network, layerrnorm (x) represents layer normalization. X is x _i Is the output of the residual network of the previous frequency domain attention mechanism,is the output of the first feed-forward neural network; x is x _i Is an manifestation of the residual, and so on. x's' _i Is the output of the multihead attention mechanism neural network; x' _i Is the output of a 1-dimensional convolutional neural network; y is _i Is the output of the layer normalization.

Conformer encoders combine the advantages of CNN and transfomer, where CNN is efficient in acquiring local features, while transfomer is more efficient in extracting long sequence dependencies, with a smaller number of model parameters to obtain better prediction results than transfomer models.

Step 4, combining the two decoders, and decoding the characteristics obtained in the step 3; the first Decoder is a transform end-to-end based Decoder (transform Seq2Seq Decoder), contains 6 layers of Transformer Decoder base blocks, and is trained with cross entropy loss. The second decoder relies on the linear layer of the join-sense timing class (Connectionist temporal classification, CTC for short) for training decoding, comprising 4 layers of linear layers and corresponding ReLu activation functions, the output being the CTC posterior probability of each input frame, and the whole stack being CTC penalty trained.

The output of each model in steps 2-4 above is shown in fig. 5. The original continuous lip has an image size (T _f ×120 ² X 1) size, wherein T _f Is the number of single video consecutive lip images and 120 is the length and width of the lip images. The size of the 2-dimensional time domain convolution output feature is (T _f ×28 ² X 64) the size of the characteristic diagram after the convolutional neural network of the frequency domain attention mechanism is (T) _f X 2048) and finally Conformer Encoder has an output size (T _f ×512)。

And 5, dividing the data obtained in the step 1 into a training set, a verification set and a test set according to the proportion of 80%, 10% and 10%, using the model constructed in the step 2-4, applying a loss function of the mixed CTC/Attention, training the training data, and performing model verification by using the data of the verification set.

The result of the mixed CTC/Attention decoding requires calculation of a loss value, namely the following conditional probability; the loss function of the hybrid CTC/Attention is specifically assumed to be x= [ x _h ……x _T ]Is the output sequence of the model constructed in the step 4, and y= [ y ] _h ……y _L ]Is a target label corresponding to the feature, and T and L represent the length of the input feature and the target label, respectively.

First, the CTC loss function assumes that each output prediction is conditionally independent, having the form:

where x is the output of the linear layer and y is the real label to which the feature corresponds. p represents a conditional probability, p _CTC The probability of getting x and counting y at x (i.e., conditional probability) using CTC decoder is shown. L represents the sequence length and L represents the value in length.

Second, the model of the attention-mechanism-based loss function can be expressed as the following formula by directly estimating the posterior probability based on the chain law:

where x' is the output of Transformer Encoder and y is the actual tag to which the feature corresponds. P is p _CE The conditional probability calculated by the result of the Attention decoding is represented, and the probability is the loss value.

Loss function of final hybrid CTC/AttentionThe specific formula is as follows, which is obtained by combining the two loss functions:

the model structure of step 2-5 is shown in fig. 6, and lip data obtained through data processing sequentially passes through a 2-dimensional time domain convolutional neural network, a convolutional neural network of a frequency domain Attention mechanism and a Conformer network to obtain a prediction result, and decoding is performed through two decoders to calculate a Loss value.

And 6, testing the model obtained by training in the step 5, inputting data of a test set into the model, and further improving the result of the model by using the RNN-based language model.

Specifically, the language model is given a dictionary V, ω _i In the case of e V, an arbitrary sequence (ω ₁ ，ω ₂ ，ω ₃ ，…，ω _n ) Probability P (ω) of being a sentence ₁ ，ω ₂ ，ω ₃ ，…，ω _n ). The probability formula here can be morphed intoSo the calculation P (ω _i |ω ₁ ，…，ω _i-1 ) The model of values is considered as a language model. The present embodiment uses an RNN-based language model that breaks the Markov assumption and can rely on all words in front of the current position and calculate the probability, ω, as compared to N-gram and feedforward neural network-based language models _i Representing words in the dictionary.

As shown in FIG. 7, the conditional probability for each location in the language model of the RNN is determined by the outputs of all RNN neural units in front of the location, specifically P (ω _i |ω ₁ ，…，ω _i-1 )＝f _θ (ω ₁ ，ω ₂ ，ω ₃ ，…，ω _i-1 )，Where f is understood as the network model of the RNN and θ represents a parameter in the neural network model.

The results of the decoder decoding and the weighting of the language model are combined by shallow fusion as described in the formula:

wherein ,is a set of predictions of the target tag. λ is the relative CTC weight of the decoding stage, and β is the relative weight of the language model, with λ set to 0.1 and β set to 0.6, respectively. />Representing the final prediction of the frame picture by the model, p _LM Representing the probability of the language model for the current y (the true label to which the feature corresponds).

The embodiment of the invention aims at 1-dimensional TCN, and designs 2-dimensional TCN to replace the original 3-dimensional convolution, meanwhile, the down-sampling process in the Conformer encoder can lead to the great reduction of the number of characteristic sequences, the source code needs to be modified, and part of the down-sampling process needs to be deleted, so that the number of sequences is reduced in a small amount in the down-sampling process. According to the embodiment of the invention, the 2D-TCN is used for mainly capturing the front and rear related features of the time sequence, and the depth feature extraction is carried out on the features output by the 2D-TCN by combining a frequency domain attention mechanism, so that excellent features are enhanced, and the poor features are weakened.

The ablation experiment of the embodiment of the invention on the LRS2 dataset is shown in Table 1:

TABLE 1 error Rate statistics

Method	WER (error rate)
		Baseline	63.5％
+hybrid CTC/intent decoding	49.0％
		+convolution enhanced transducer coder (Conformer encoder)	42.4％
+frequency domain attention mechanism	37.7％
		+2-dimensional time domain convolution	37.1％

Baseline adopts 3D convolution, resNet-50, a multi-head attention mechanism transform and CTC decoding to obtain an error rate of 63.5%; the error rate is reduced to 49% by adopting a mixed CTC/attention decoding mode, the error rate is further reduced to 42.4% by adding convolution enhanced Transformer (Conformer), the error rate is further reduced to 37.7% by adding a frequency domain attention mechanism, and the error rate is reduced to 37.1% by finally adding a 2-dimensional time domain convolution neural network.

The model parameters are shown in Table 2:

table 2 model parameters

Model before replacement	Replaced model
		Transformer Encoder：20.2M	Conformer Encoder：19.7M
ResNet-50：23.5M	Based on a frequency domain attention mechanism Resnet-34:21.5M

A small scale Conformer Encoder is used while the Conformer Encoder downsampling layer is modified so that its parameters are reduced. Based on the frequency domain attention mechanism Resnet-34, although the depth is only 34, the accuracy of the model is slightly higher than ResNet-50 after the frequency domain attention mechanism is added, and the model yield is only 21.5M. The unit of the model parameters is one, M represents megabits, and 20.2M is 20.2 megabits of parameters.

The lip language identification method based on the multi-attention mechanism according to the embodiment of the invention can be stored in a computer readable storage medium if the lip language identification method is realized in the form of a software functional module and sold or used as an independent product. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the multi-attentiveness-mechanism-based lip recognition method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A lip language identification method based on a multi-attention mechanism is characterized by comprising the following steps:

step 4, performing mixed CTC/Attention decoding on the characteristics;

step 5, dividing a training set according to a proportion, and training through the constructed model and a loss function of the mixed CTC/Attention;

2. The method for lip recognition based on the multi-attention mechanism according to claim 1, wherein the step 1 comprises:

3. The method for recognizing lip language based on multi-attention mechanism according to claim 1, wherein in the step 2, a 2-dimensional time domain convolutional neural network is adopted to perform feature extraction through causal convolution and hole convolution, and feature values of a lower layer of causal convolution come from several adjacent feature graphs of a previous layer so as to extract features of time sequence; the shapes of the two-dimensional convolution kernels of the hole convolution acting on the lip image are not connected together, but are spaced apart to maximize the extraction of global information.

4. The method for recognizing lip language based on multi-attention mechanism according to claim 1, wherein in the step 2, a residual convolution neural network based on a frequency domain attention mechanism converts a multi-channel feature map output by each Block of Res-Net into a corresponding frequency domain map, and a value of a certain position in the frequency domain map is selected, and the value is learned by a feedforward neural network to finally obtain the weight of the channel; the channel weight is multiplied by the feature map on the channel to play a role in attention.

5. The multi-attention mechanism based lip recognition method of claim 1, wherein in step 3, the convolutional enhanced transducer encoder comprises an emdeling module and a set of transducer models; the Emdedding module comprises a convolution downsampling layer and a linear layer, wherein the convolution downsampling layer reduces the feature dimension, and then the feature is mapped into D through the linear layer _k Dimension; the Conformer model is formed by sequentially stacking a feedforward neural network module, a multi-head self-attention module, a convolution module and a feedforward neural network module, wherein each module is connected with a normalization layer and then a random inactivation layer, residual error chains are connected in each module, and residual error data are input data so as to prevent gradient explosion and overfitting.

6. The multi-attention mechanism based lip recognition method of claim 1 wherein the feed forward neural network module is comprised of a plurality of d-dimensional linear layers, each linear layer comprising a Swish activation function layer and a Dropout layer therebetween, the Swish activation function f (x) =x.sigma (x),

where x is the characteristic of the input,sigma (x) represents an intermediate parameter, e being a constant;

the multi-head self-attention module takes as input a query Q, a key K and a value V, wherein T represents the length of the signature sequence, d _Q 、d _k and d_v Dimensions of query, key, and value, respectively; let q=k=v in the encoder, and W _i ^Q 、/> and />Output matrix f of the ith self-attention, linear transformation weights denoted Q, K and V, respectively _i (Q′ _i ,K′ _i ,V _i ') is calculated as follows:

7. The multi-Attention mechanism based lip recognition method of claim 1, wherein in step 4, the hybrid CTC/Attention decoding includes two decoders, one decoder being a fransformer end-to-end based decoder, comprising six layers of Transformer Decoder base blocks, and trained with cross entropy loss; the other decoder relies on the linear layer of the joint sense time sequence classification to carry out training decoding, comprises 4 four layers of linear layers and corresponding ReLu activation functions, outputs the CTC posterior probability of each input frame, and the whole stack is subjected to CTC loss training.

8. The multi-Attention mechanism based lip recognition method of claim 1, wherein in step 4, the loss function of the mixed CTC/Attention is as follows:

wherein ,representing the loss function of the mixed CTC/Attention, pCTC represents the probability of getting x through the CTC decoder and counting y under the condition of x, i.e., the conditional probability; x is the output of the linear layer, y is the real label corresponding to the feature; p is p _CE Representing a loss value based on an attention mechanism; alpha is the weight of CTC loss and (1-alpha) is the loss weight based on the attention mechanism.

9. An electronic device, characterized in that the lip recognition is implemented by the method according to any of claims 1-8.

10. A computer storage medium having stored therein at least one program instruction that is loaded and executed by a processor to implement the multi-attention mechanism based lip language identification method of any one of claims 1-8.