CN116580278A - Lip language identification method, equipment and storage medium based on multi-attention mechanism - Google Patents

Lip language identification method, equipment and storage medium based on multi-attention mechanism Download PDF

Info

Publication number
CN116580278A
CN116580278A CN202310562028.3A CN202310562028A CN116580278A CN 116580278 A CN116580278 A CN 116580278A CN 202310562028 A CN202310562028 A CN 202310562028A CN 116580278 A CN116580278 A CN 116580278A
Authority
CN
China
Prior art keywords
lip
convolution
attention
layer
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310562028.3A
Other languages
Chinese (zh)
Inventor
张晖
杨胜
宝音都古楞
飞龙
巩政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University
Original Assignee
Inner Mongolia University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University filed Critical Inner Mongolia University
Priority to CN202310562028.3A priority Critical patent/CN116580278A/en
Publication of CN116580278A publication Critical patent/CN116580278A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a lip language identification method, equipment and storage medium based on a multi-attention mechanism, wherein the lip language identification method comprises the following steps: preprocessing a video data set to obtain continuous character mouth gray level images, and performing data enhancement processing at the same time; performing preliminary feature extraction on continuous lip images through a time domain convolutional neural network, and performing deep feature extraction through a residual convolutional neural network based on a frequency domain attention mechanism; encoding the extracted features by a convolution enhanced transducer encoder; performing mixed CTC/Attention decoding on the characteristics; training through the constructed model and the loss function of the mixed CTC/Attention; the output of the model is further improved by the RNN-based language model. The invention improves the accuracy of lip language identification and simultaneously improves the speed of lip language identification.

Description

Lip language identification method, equipment and storage medium based on multi-attention mechanism
Technical Field
The invention belongs to the technical field of image recognition, and relates to a lip language recognition method, equipment and a storage medium based on a multi-attention mechanism.
Background
Lip recognition is a technique for recognizing what is said based on continuous changes in lip area. There are extremely important effects in various fields, such as: the method can improve the voice recognition of the voice in a noisy environment, help the communication and the sound filling of the soundless monitoring of the postnatal deaf-mute and the normal person.
In the lip language recognition, the traditional machine learning model has poor effect, along with the continuous development of deep learning, the earliest researchers design a model applying 3D convolution and adding a deeper network ResNet residual neural network) and can capture time sequence characteristics LSTM (long and short time memory neural network) to perform lip language recognition, and although the effect is greatly improved compared with the traditional machine learning, the effect is still difficult to satisfy.
Later, researchers designed a time domain convolutional network (Temporal Convolutional Network, abbreviated as TCN) to predict, and improved by using a proper training strategy, namely, a Densely connected time convolutional network (Densely-Connected Temporal Convolutional Networks, abbreviated as DC-TCn) is used and designed to predict, so that the lip recognition effect is further improved.
Current researchers have added a multi-headed attention mechanism to improve the recognition effect of lip recognition on homonyms, and designed encoders based on visual transducer pooling (Visual Transformer Pooling, VTP for short) structures, which bring the accuracy of lip recognition to a new level. However, the model parameter amount of the technology is large, and the technology is not applicable to a scene of real-time lip language identification.
Disclosure of Invention
In order to solve the problems, the invention provides a lip language identification method based on a multi-attention mechanism, which improves the accuracy of lip language identification, improves the speed of lip language identification and solves the problems in the prior art.
A second object of the present invention is to provide an electronic device.
A third object of the present invention is to provide a computer storage medium.
The technical scheme adopted by the invention is that the lip language identification method based on the multi-attention mechanism comprises the following steps:
step 1, preprocessing a video data set to obtain continuous character mouth gray level images, and performing data enhancement processing at the same time;
step 2, performing preliminary feature extraction on continuous lip images through a time domain convolutional neural network, and then performing deep feature extraction through a residual convolutional neural network based on a frequency domain attention mechanism;
step 3, encoding the extracted features by a convolution enhancement transducer encoder;
step 4, performing mixed CTC/Attention decoding on the characteristics;
step 5, dividing the data obtained in the step 1 into training sets according to the proportion, and training the training data through the model constructed in the step 2-4 and the loss function of the mixed CTC/Attention;
and 6, further improving the output result of the model through the language model based on the RNN.
Further, the step 1 includes:
step 1-1, separating image data of each frame in a video data set to obtain continuous image data corresponding to a video;
step 1-2, carrying out gray processing on continuous image data to eliminate the influence caused by lip color, carrying out face detection through a face recognition library, marking face key points, obtaining lip center point coordinates according to the coordinates of the lip key points, and cutting out lip images comprising all lips, part of chin and part of environment by taking the center point as an origin;
step 1-3, carrying out data enhancement processing on the lip images, carrying out random horizontal or vertical overturn on the lip images of each frame, and randomly covering partial areas;
and step 1-4, matching the lip continuous images obtained in the step 1-3 with corresponding text contents, and directly taking the corresponding text contents as labels of the continuous images.
Further, in the step 2, a 2-dimensional time domain convolutional neural network is adopted, feature extraction is performed through causal convolution and cavity convolution, and feature values of a lower layer of causal convolution come from several adjacent feature graphs of an upper layer so as to extract features of a time sequence; the shapes of the two-dimensional convolution kernels of the hole convolution acting on the lip image are not connected together, but are spaced apart to maximize the extraction of global information.
Further, in the step 2, a residual convolution neural network based on a frequency domain attention mechanism converts a multi-channel feature map output by each Block of the Res-Net into a corresponding frequency domain map, and a value at a certain position in the frequency domain map is selected, and the value is learned through a feedforward neural network to finally obtain the weight of the channel; the channel weight is multiplied by the feature map on the channel to play a role in attention.
Further, in the step 3, the convolutional enhanced transducer encoder includes an emdecoding module and a set of transducer models; the Emdedding module comprises a convolution downsampling layer and a linear layer, wherein the convolution downsampling layer reduces the feature dimension, and then the feature is mapped into D through the linear layer k Dimension; the Conformer model is formed by sequentially stacking a feedforward neural network module, a multi-head self-attention module, a convolution module and a feedforward neural network module, wherein each module is connected with a normalization layer and then a random inactivation layer, residual error chains are connected in each module, and residual error data are input data so as to prevent gradient explosion and overfitting.
Further, the feedforward neural network module is composed of a plurality of d-dimensional linear layers, each linear layer comprises a Swish activation function layer and a Dropout layer, the Swish activation function f (x) =x.sigma (x),
where x is the characteristic of the input,σ (x) represents an intermediate parameter, e being a constant.
The multi-head self-attention module takes as input a query Q, a key K and a value V, wherein T represents the length of the signature sequence, d Q 、d k and dv Dimensions of query, key, and value, respectively; let q=k=v in the encoder, and W i Q 、/> and />Output matrix f of the ith self-attention, linear transformation weights denoted Q, K and V, respectively i (Q′ i ,K′ i ,V′ i ) The calculation formula of (2) is as follows:
wherein ,Q′i =QW i QQ′ i 、K′ i 、V′ i Respectively representing a query vector, a key vector and a value vector;
the convolution module comprises a point-by-point convolution layer, a GLU activation function, a one-dimensional depth convolution layer, a Swish activation function layer, and finally a point-by-point convolution layer and a normalization layer.
Further, in the step 4, the hybrid CTC/Attention decoding includes two decoders, one is a transform end-to-end based decoder, which includes six layers of Transformer Decoder basic blocks and is trained with cross entropy loss; the other decoder relies on the linear layer of the joint sense time sequence classification to carry out training decoding, comprises 4 four layers of linear layers and corresponding ReLu activation functions, outputs the CTC posterior probability of each input frame, and the whole stack is subjected to CTC loss training.
Further, in the step 4, the loss function of the mixed CTC/Attention is as follows:
wherein ,representing the loss function, p, of the mixed CTC/Attention CTC Representing the probability of getting x and counting y under the condition of x, i.e., conditional probability, by CTC decoder; x is the output of the linear layer, y is the real label corresponding to the feature; p is p CE Representing a loss value based on an attention mechanism; alpha is the weight of CTC loss and (1-alpha) is the loss weight based on the attention mechanism.
An electronic device adopts the method to realize lip language identification.
A computer storage medium having stored therein at least one program instruction that is loaded and executed by a processor to implement the multi-attention mechanism based lip language identification method described above.
The beneficial effects of the invention are as follows:
according to the embodiment of the invention, through using 2-dimensional time domain convolution and adding a frequency domain attention mechanism and a multi-head attention mechanism, the lip language recognition effect is further improved, the error rate is reduced, and meanwhile, a lightweight residual convolution neural network (ResNet-18) and a small-scale convolution enhancement transducer model (small Conformer) are used, so that the overall parameter number of the model is reduced; the accuracy rate of lip language identification is improved, and meanwhile, the speed of lip language identification is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an overall flow chart of an embodiment of the present invention.
Fig. 2 is a schematic diagram of a 2-dimensional time domain convolutional neural network in an embodiment of the present invention.
Fig. 3 is a schematic diagram of a frequency domain attention mechanism according to an embodiment of the present invention.
Fig. 4 is a configuration diagram of a Conformer in an embodiment of the present invention.
FIG. 5 is a schematic diagram of exemplary output feature sizes in accordance with an embodiment of the present invention.
FIG. 6 is a diagram of an overall model structure in an embodiment of the invention.
FIG. 7 is a diagram of an RNN-based language model according to an embodiment of the present invention.
Detailed Description
The technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
A lip language identification method based on a multi-attention mechanism is shown in fig. 1, and comprises three parts of data processing, model construction and improved results. In terms of data processing, a gray scale image of the mouth region is segmented and a data enhancement method is used to prevent model overfitting. In terms of model construction, feature extraction and feature encoding are performed on continuous pictures of lips through a 2-dimensional time domain convolutional neural network (2D Temporal Convolutional Networks, abbreviated as 2D-TCN), a convolutional neural network of a frequency domain attention mechanism and a convolutional enhanced transducer encoder (Convolume-augmented Transformer, abbreviated as Conformer) of a multi-head attention mechanism, features are decoded by using two decoders, and values obtained by corresponding loss functions are weighted. Improving the results, the RNN-based language model is used to improve the final results. To increase model recognition speed, a lightweight residual convolutional neural network (ResNet-18) is used as the base model for the frequency domain attention mechanism, and a small-scale convolutional enhanced transducer model (small Conformer) is used.
The method specifically comprises the following steps:
step 1, preprocessing a video data set to obtain continuous character mouth gray level images, and simultaneously, performing data enhancement processing to improve generalization capability of a model.
In step 1-1, the image data of each frame in the video dataset is separated by using Fast Forward Mpeg (ffmpeg) program, so as to obtain continuous image data corresponding to the video.
Step 1-2, carrying out gray processing on the continuous image data in the step 1-1, eliminating the influence caused by lip color, carrying out face detection by using a cross-platform face library recognition library Dlib developed by C++ programming, marking 68 key points on the face, obtaining coordinates of a lip center point according to the coordinates of the lip key points, cutting out lip images comprising all lips, part of chin and part of environment by taking the center point as an origin, and obtaining the lip images with the size of 120px multiplied by 120px.
Step 1-3, carrying out data enhancement processing on the lip images, carrying out random horizontal or vertical overturn on the lip images of each frame, and randomly covering partial areas (10 percent); and preventing the data from being over-fitted in the training process.
Step 1-4, matching the lip continuous images obtained in step 1-3 with corresponding text contents, wherein the corresponding text sentence contents are directly used as labels of the continuous images due to sentence-level lip recognition.
And 2, performing preliminary feature extraction on continuous lip images through a 2-dimensional time domain convolutional neural network (2D Temporal Convolutional Networks, 2D-TCN for short), and then performing deep feature extraction by using a residual convolutional neural network based on a frequency domain attention mechanism.
The 2D-TCN performs feature extraction through causal convolution and cavity convolution, wherein the causal convolution is shown in fig. 2, and the feature values of the lower layer come from several adjacent feature graphs of the upper layer, so that the features of the time sequence can be extracted; the shape of the two-dimensional convolution kernel acting on the lip image is not connected together, but is spaced apart, as shown in fig. 2, so that global information can be extracted to the greatest extent.
As shown in fig. 3, the residual convolution neural network of the frequency domain attention mechanism converts the multi-channel feature map output by each Block of Res-Net into a corresponding frequency domain map, and selects a value at a certain position in the frequency domain map, and the value is learned through the feedforward neural network, so that the weight of the channel can be obtained finally; as shown in fig. 3, the time-space domain diagram is converted to a frequency domain diagram using a 2-dimensional DCT discrete cosine transform, the DCT transform formula is as follows:
s.t.h∈{0,1,…,H-1},w∈{0,1,…,W-1},
wherein H and W represent the height and width of the image, respectively,original image value representing the position of coordinates (i, j) in the 2-dimensional image, +.>Is the 2-dimensional DCT frequency domain value of the (h, w) position in the 2-dimensional image. The channel weight is multiplied by the characteristic diagram on the channel to play a role in attention, namely, the excellent characteristic diagram is reinforced, and the poor characteristic diagram is weakened. The function of the 2-dimensional time domain convolution in the invention is to capture the feature of continuous change of the lip, compared with the 3-dimensional convolution, the causal convolution in the 2-dimensional time domain convolution can capture longer sequence information, and meanwhile, the cavity convolution in the 2-dimensional time domain convolution has a larger receptive field and can capture more features.
Step 3, the feature obtained in step 2 is encoded by using a Convolution enhancement transducer encoder (Convolume-augmented Transformer, conformer for short).
The configuration of the Conformer encoder is shown in FIG. 4, and consists of an Emdedding module and a set of Conformer models. The Emdedding module consists of a convolution downsampling layer which reduces the feature dimension to 1/4 of the original dimension and a linear layer which then maps the feature to D k Dimension. The structure of the Conformer model is shown in the dashed box in FIG. 4, and is composed of a feedforward neural network module, a Multi-headed Self-attention (MHSA) module, a convolution module and a feedforward neural network module which are sequentially stacked, wherein in order to prevent gradient explosion and over fitting, each module is a layer normalization (Layerneorm) followed by a random inactivation (Dropout) layer, and each module is internally connected with a residual chain, and residual data is input data. Particularly, the feedforward neural network module consists of a plurality of d-dimensional linear layers, wherein a Swish activation function layer and a Dropout layer are contained between each linear layer, and compared with other activation functions, the Swish activation function can better handle the nonlinear problem, and the mathematical formula of the Swish activation function is as follows: f (x) =x·σ (x),
where f (x) denotes the Swish activation function, x is the characteristic of the input,sigma (x) represents an intermediate parameter, x is restricted to between 0 and 1, and e is a constant.
The multi-head self-attention module takes as input a query Q, a key K and a value V, wherein Representing a two-dimensional real matrix, T representing the length of the signature sequence, d Q 、d k and dv The dimensions of the query, key, and value, respectively. Let q=k=v in the encoder, and W i Q 、/> and />Output matrix f of the ith self-attention, linear transformation weights denoted Q, K and V, respectively i (Q′ i ,K′ i ,V′ i ) The calculation formula of (2) is as follows:
wherein ,Q′i =QW i QIterative attention mechanism model to obtain new Q, K, V, Q' i 、K′ i 、V′ i Respectively representing a query vector, a key vector, and a value vector.
The convolution module comprises a point-by-point convolution layer with a spreading factor of 2, a GLU activation function, a 1-dimensional depth convolution layer, a Swish activation function layer, and finally a point-by-point convolution layer and a normalization layer.
The formula of the Conformer module is as follows:
x″ i =x′ i +Conv(x′ i )
wherein FFN (x) represents a feed-forward neural network, MHSA (x) represents a multi-head attention mechanism, conv (x) represents a 1-dimensional convolutional neural network, layerrnorm (x) represents layer normalization. X is x i Is the output of the residual network of the previous frequency domain attention mechanism,is the output of the first feed-forward neural network; x is x i Is an manifestation of the residual, and so on. x's' i Is the output of the multihead attention mechanism neural network; x' i Is the output of a 1-dimensional convolutional neural network; y is i Is the output of the layer normalization.
Conformer encoders combine the advantages of CNN and transfomer, where CNN is efficient in acquiring local features, while transfomer is more efficient in extracting long sequence dependencies, with a smaller number of model parameters to obtain better prediction results than transfomer models.
Step 4, combining the two decoders, and decoding the characteristics obtained in the step 3; the first Decoder is a transform end-to-end based Decoder (transform Seq2Seq Decoder), contains 6 layers of Transformer Decoder base blocks, and is trained with cross entropy loss. The second decoder relies on the linear layer of the join-sense timing class (Connectionist temporal classification, CTC for short) for training decoding, comprising 4 layers of linear layers and corresponding ReLu activation functions, the output being the CTC posterior probability of each input frame, and the whole stack being CTC penalty trained.
The output of each model in steps 2-4 above is shown in fig. 5. The original continuous lip has an image size (T f ×120 2 X 1) size, wherein T f Is the number of single video consecutive lip images and 120 is the length and width of the lip images. The size of the 2-dimensional time domain convolution output feature is (T f ×28 2 X 64) the size of the characteristic diagram after the convolutional neural network of the frequency domain attention mechanism is (T) f X 2048) and finally Conformer Encoder has an output size (T f ×512)。
And 5, dividing the data obtained in the step 1 into a training set, a verification set and a test set according to the proportion of 80%, 10% and 10%, using the model constructed in the step 2-4, applying a loss function of the mixed CTC/Attention, training the training data, and performing model verification by using the data of the verification set.
The result of the mixed CTC/Attention decoding requires calculation of a loss value, namely the following conditional probability; the loss function of the hybrid CTC/Attention is specifically assumed to be x= [ x h ……x T ]Is the output sequence of the model constructed in the step 4, and y= [ y ] h ……y L ]Is a target label corresponding to the feature, and T and L represent the length of the input feature and the target label, respectively.
First, the CTC loss function assumes that each output prediction is conditionally independent, having the form:
where x is the output of the linear layer and y is the real label to which the feature corresponds. p represents a conditional probability, p CTC The probability of getting x and counting y at x (i.e., conditional probability) using CTC decoder is shown. L represents the sequence length and L represents the value in length.
Second, the model of the attention-mechanism-based loss function can be expressed as the following formula by directly estimating the posterior probability based on the chain law:
where x' is the output of Transformer Encoder and y is the actual tag to which the feature corresponds. P is p CE The conditional probability calculated by the result of the Attention decoding is represented, and the probability is the loss value.
Loss function of final hybrid CTC/AttentionThe specific formula is as follows, which is obtained by combining the two loss functions:
the model structure of step 2-5 is shown in fig. 6, and lip data obtained through data processing sequentially passes through a 2-dimensional time domain convolutional neural network, a convolutional neural network of a frequency domain Attention mechanism and a Conformer network to obtain a prediction result, and decoding is performed through two decoders to calculate a Loss value.
And 6, testing the model obtained by training in the step 5, inputting data of a test set into the model, and further improving the result of the model by using the RNN-based language model.
Specifically, the language model is given a dictionary V, ω i In the case of e V, an arbitrary sequence (ω 1 ,ω 2 ,ω 3 ,…,ω n ) Probability P (ω) of being a sentence 1 ,ω 2 ,ω 3 ,…,ω n ). The probability formula here can be morphed intoSo the calculation P (ω i1 ,…,ω i-1 ) The model of values is considered as a language model. The present embodiment uses an RNN-based language model that breaks the Markov assumption and can rely on all words in front of the current position and calculate the probability, ω, as compared to N-gram and feedforward neural network-based language models i Representing words in the dictionary.
As shown in FIG. 7, the conditional probability for each location in the language model of the RNN is determined by the outputs of all RNN neural units in front of the location, specifically P (ω i1 ,…,ω i-1 )=f θ1 ,ω 2 ,ω 3 ,…,ω i-1 ),Where f is understood as the network model of the RNN and θ represents a parameter in the neural network model.
The results of the decoder decoding and the weighting of the language model are combined by shallow fusion as described in the formula:
wherein ,is a set of predictions of the target tag. λ is the relative CTC weight of the decoding stage, and β is the relative weight of the language model, with λ set to 0.1 and β set to 0.6, respectively. />Representing the final prediction of the frame picture by the model, p LM Representing the probability of the language model for the current y (the true label to which the feature corresponds).
The embodiment of the invention aims at 1-dimensional TCN, and designs 2-dimensional TCN to replace the original 3-dimensional convolution, meanwhile, the down-sampling process in the Conformer encoder can lead to the great reduction of the number of characteristic sequences, the source code needs to be modified, and part of the down-sampling process needs to be deleted, so that the number of sequences is reduced in a small amount in the down-sampling process. According to the embodiment of the invention, the 2D-TCN is used for mainly capturing the front and rear related features of the time sequence, and the depth feature extraction is carried out on the features output by the 2D-TCN by combining a frequency domain attention mechanism, so that excellent features are enhanced, and the poor features are weakened.
The ablation experiment of the embodiment of the invention on the LRS2 dataset is shown in Table 1:
TABLE 1 error Rate statistics
Method WER (error rate)
Baseline 63.5%
+hybrid CTC/intent decoding 49.0%
+convolution enhanced transducer coder (Conformer encoder) 42.4%
+frequency domain attention mechanism 37.7%
+2-dimensional time domain convolution 37.1%
Baseline adopts 3D convolution, resNet-50, a multi-head attention mechanism transform and CTC decoding to obtain an error rate of 63.5%; the error rate is reduced to 49% by adopting a mixed CTC/attention decoding mode, the error rate is further reduced to 42.4% by adding convolution enhanced Transformer (Conformer), the error rate is further reduced to 37.7% by adding a frequency domain attention mechanism, and the error rate is reduced to 37.1% by finally adding a 2-dimensional time domain convolution neural network.
The model parameters are shown in Table 2:
table 2 model parameters
Model before replacement Replaced model
Transformer Encoder:20.2M Conformer Encoder:19.7M
ResNet-50:23.5M Based on a frequency domain attention mechanism Resnet-34:21.5M
A small scale Conformer Encoder is used while the Conformer Encoder downsampling layer is modified so that its parameters are reduced. Based on the frequency domain attention mechanism Resnet-34, although the depth is only 34, the accuracy of the model is slightly higher than ResNet-50 after the frequency domain attention mechanism is added, and the model yield is only 21.5M. The unit of the model parameters is one, M represents megabits, and 20.2M is 20.2 megabits of parameters.
The lip language identification method based on the multi-attention mechanism according to the embodiment of the invention can be stored in a computer readable storage medium if the lip language identification method is realized in the form of a software functional module and sold or used as an independent product. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the multi-attentiveness-mechanism-based lip recognition method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (10)

1. A lip language identification method based on a multi-attention mechanism is characterized by comprising the following steps:
step 1, preprocessing a video data set to obtain continuous character mouth gray level images, and performing data enhancement processing at the same time;
step 2, performing preliminary feature extraction on continuous lip images through a time domain convolutional neural network, and then performing deep feature extraction through a residual convolutional neural network based on a frequency domain attention mechanism;
step 3, encoding the extracted features by a convolution enhancement transducer encoder;
step 4, performing mixed CTC/Attention decoding on the characteristics;
step 5, dividing a training set according to a proportion, and training through the constructed model and a loss function of the mixed CTC/Attention;
and 6, further improving the output result of the model through the language model based on the RNN.
2. The method for lip recognition based on the multi-attention mechanism according to claim 1, wherein the step 1 comprises:
step 1-1, separating image data of each frame in a video data set to obtain continuous image data corresponding to a video;
step 1-2, carrying out gray processing on continuous image data to eliminate the influence caused by lip color, carrying out face detection through a face recognition library, marking face key points, obtaining lip center point coordinates according to the coordinates of the lip key points, and cutting out lip images comprising all lips, part of chin and part of environment by taking the center point as an origin;
step 1-3, carrying out data enhancement processing on the lip images, carrying out random horizontal or vertical overturn on the lip images of each frame, and randomly covering partial areas;
and step 1-4, matching the lip continuous images obtained in the step 1-3 with corresponding text contents, and directly taking the corresponding text contents as labels of the continuous images.
3. The method for recognizing lip language based on multi-attention mechanism according to claim 1, wherein in the step 2, a 2-dimensional time domain convolutional neural network is adopted to perform feature extraction through causal convolution and hole convolution, and feature values of a lower layer of causal convolution come from several adjacent feature graphs of a previous layer so as to extract features of time sequence; the shapes of the two-dimensional convolution kernels of the hole convolution acting on the lip image are not connected together, but are spaced apart to maximize the extraction of global information.
4. The method for recognizing lip language based on multi-attention mechanism according to claim 1, wherein in the step 2, a residual convolution neural network based on a frequency domain attention mechanism converts a multi-channel feature map output by each Block of Res-Net into a corresponding frequency domain map, and a value of a certain position in the frequency domain map is selected, and the value is learned by a feedforward neural network to finally obtain the weight of the channel; the channel weight is multiplied by the feature map on the channel to play a role in attention.
5. The multi-attention mechanism based lip recognition method of claim 1, wherein in step 3, the convolutional enhanced transducer encoder comprises an emdeling module and a set of transducer models; the Emdedding module comprises a convolution downsampling layer and a linear layer, wherein the convolution downsampling layer reduces the feature dimension, and then the feature is mapped into D through the linear layer k Dimension; the Conformer model is formed by sequentially stacking a feedforward neural network module, a multi-head self-attention module, a convolution module and a feedforward neural network module, wherein each module is connected with a normalization layer and then a random inactivation layer, residual error chains are connected in each module, and residual error data are input data so as to prevent gradient explosion and overfitting.
6. The multi-attention mechanism based lip recognition method of claim 1 wherein the feed forward neural network module is comprised of a plurality of d-dimensional linear layers, each linear layer comprising a Swish activation function layer and a Dropout layer therebetween, the Swish activation function f (x) =x.sigma (x),
where x is the characteristic of the input,sigma (x) represents an intermediate parameter, e being a constant;
the multi-head self-attention module takes as input a query Q, a key K and a value V, wherein T represents the length of the signature sequence, d Q 、d k and dv Dimensions of query, key, and value, respectively; let q=k=v in the encoder, and W i Q 、/> and />Output matrix f of the ith self-attention, linear transformation weights denoted Q, K and V, respectively i (Q′ i ,K′ i ,V i ') is calculated as follows:
wherein ,Q′i =QW i QQ′ i 、K′ i 、V′ i Respectively representing a query vector, a key vector and a value vector;
the convolution module comprises a point-by-point convolution layer, a GLU activation function, a one-dimensional depth convolution layer, a Swish activation function layer, and finally a point-by-point convolution layer and a normalization layer.
7. The multi-Attention mechanism based lip recognition method of claim 1, wherein in step 4, the hybrid CTC/Attention decoding includes two decoders, one decoder being a fransformer end-to-end based decoder, comprising six layers of Transformer Decoder base blocks, and trained with cross entropy loss; the other decoder relies on the linear layer of the joint sense time sequence classification to carry out training decoding, comprises 4 four layers of linear layers and corresponding ReLu activation functions, outputs the CTC posterior probability of each input frame, and the whole stack is subjected to CTC loss training.
8. The multi-Attention mechanism based lip recognition method of claim 1, wherein in step 4, the loss function of the mixed CTC/Attention is as follows:
wherein ,representing the loss function of the mixed CTC/Attention, pCTC represents the probability of getting x through the CTC decoder and counting y under the condition of x, i.e., the conditional probability; x is the output of the linear layer, y is the real label corresponding to the feature; p is p CE Representing a loss value based on an attention mechanism; alpha is the weight of CTC loss and (1-alpha) is the loss weight based on the attention mechanism.
9. An electronic device, characterized in that the lip recognition is implemented by the method according to any of claims 1-8.
10. A computer storage medium having stored therein at least one program instruction that is loaded and executed by a processor to implement the multi-attention mechanism based lip language identification method of any one of claims 1-8.
CN202310562028.3A 2023-05-18 2023-05-18 Lip language identification method, equipment and storage medium based on multi-attention mechanism Pending CN116580278A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310562028.3A CN116580278A (en) 2023-05-18 2023-05-18 Lip language identification method, equipment and storage medium based on multi-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310562028.3A CN116580278A (en) 2023-05-18 2023-05-18 Lip language identification method, equipment and storage medium based on multi-attention mechanism

Publications (1)

Publication Number Publication Date
CN116580278A true CN116580278A (en) 2023-08-11

Family

ID=87540910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310562028.3A Pending CN116580278A (en) 2023-05-18 2023-05-18 Lip language identification method, equipment and storage medium based on multi-attention mechanism

Country Status (1)

Country Link
CN (1) CN116580278A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152542A (en) * 2023-10-30 2023-12-01 武昌理工学院 Image classification method and system based on lightweight network
CN117958813A (en) * 2024-03-28 2024-05-03 北京科技大学 ECG (ECG) identity recognition method, system and equipment based on attention depth residual error network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152542A (en) * 2023-10-30 2023-12-01 武昌理工学院 Image classification method and system based on lightweight network
CN117152542B (en) * 2023-10-30 2024-01-30 武昌理工学院 Image classification method and system based on lightweight network
CN117958813A (en) * 2024-03-28 2024-05-03 北京科技大学 ECG (ECG) identity recognition method, system and equipment based on attention depth residual error network

Similar Documents

Publication Publication Date Title
Han et al. A survey on vision transformer
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
Ge et al. An attention mechanism based convolutional LSTM network for video action recognition
CN110633683B (en) Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM
CN116580278A (en) Lip language identification method, equipment and storage medium based on multi-attention mechanism
Wu et al. GINet: Graph interaction network for scene parsing
CN113806587A (en) Multi-mode feature fusion video description text generation method
Bi et al. Iemask r-cnn: Information-enhanced mask r-cnn
CN110232564A (en) A kind of traffic accident law automatic decision method based on multi-modal data
Zhu et al. Image-text matching with fine-grained relational dependency and bidirectional attention-based generative networks
CN111325766A (en) Three-dimensional edge detection method and device, storage medium and computer equipment
CN112804558A (en) Video splitting method, device and equipment
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
Wen et al. Fast LiDAR R-CNN: Residual relation-aware region proposal networks for multiclass 3-D object detection
CN113747168A (en) Training method of multimedia data description model and generation method of description information
CN114283181B (en) Dynamic texture migration method and system based on sample
Rui et al. Data Reconstruction based on supervised deep auto-encoder
CN117036368A (en) Image data processing method, device, computer equipment and storage medium
CN114821802A (en) Continuous sign language identification method based on multi-thread mutual distillation and self-distillation
Gao et al. FSOD4RSI: Few-Shot Object Detection for Remote Sensing Images Via Features Aggregation and Scale Attention
Ni et al. Background and foreground disentangled generative adversarial network for scene image synthesis
CN110188706B (en) Neural network training method and detection method based on character expression in video for generating confrontation network
Xu et al. Video Object Segmentation: Tasks, Datasets, and Methods
Chen et al. Lung Segmentation Network Based on Lightweight and Attention Fusion with U-Net
Zhu et al. Lip-Reading Based on Deep Learning Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination