CN115910065A

CN115910065A - Lip language identification method, system and medium based on subspace sparse attention mechanism

Info

Publication number: CN115910065A
Application number: CN202211518304.8A
Authority: CN
Inventors: 陈亚雄; 赵怡晨; 路雄博; 邓梦涵; 熊盛武
Original assignee: Chongqing Research Institute Of Wuhan University Of Technology
Current assignee: Chongqing Research Institute Of Wuhan University Of Technology
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-04-04

Abstract

The invention relates to a lip language identification method, a system and a medium based on a subspace sparse attention mechanism, wherein the method comprises the following steps: obtaining a lip region image sequence, and extracting a lip feature sequence based on the lip region image sequence; inputting the lip feature sequence into a preset phoneme sequence extraction model with complete training to obtain a pronunciation phoneme sequence corresponding to the lip feature sequence; and inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence. The invention enhances the context information by constructing a special attention mechanism, realizes the prediction of the long sentence sequence in a forward operation, and greatly improves the reasoning speed and the accuracy.

Description

Lip language identification method, system and medium based on subspace sparse attention mechanism

Technical Field

The invention relates to the technical field of computer vision and natural language processing, in particular to a lip language method and system for a subspace sparse attention mechanism and a computer readable storage medium.

Background

Lip language identification (also called lip reading) refers to a process of obtaining the speaking content of a speaker by interpreting and analyzing lip movement information of the speaker, and aims to supplement auditory information by using visual information so that the expression content of people can be accurately obtained under the condition of hearing damage. The process of lip language recognition is usually to input silent video, output speech or text content. The lip language identification technology plays an extremely important role in the fields of deaf-mute assistance, life service, public security and the like.

The development of deep learning lays a solid foundation for a lip language recognition technology, but as the length of an input video increases, a deep model is difficult to train, a specific characteristic learning method is usually needed, the reasoning process is complex and time-consuming in calculation, and the optimization of a large number of parameters in the model is also involved. Therefore, with the rapid increase of the demand for lip language identification, how to improve the speed and accuracy of identification while the identification video length increases has become a critical task facing lip language identification.

Disclosure of Invention

In view of the above, it is desirable to provide a method, a system and a medium for lip language based on subspace sparse attention mechanism, so as to solve the problem that the training process is difficult to converge quickly due to the long length of the sentence sequence.

In one aspect, the invention provides a lip language identification method based on a subspace sparse attention mechanism, which comprises the following steps:

obtaining a lip region image sequence, and extracting a lip feature sequence based on the lip region image sequence;

inputting the lip feature sequence into a preset phoneme sequence extraction model with complete training to obtain a pronunciation phoneme sequence corresponding to the lip feature sequence;

and inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence.

In some possible implementations, determining the trained phoneme sequence extraction model includes:

initializing an LSTM model, taking a lip feature sequence corresponding to a lip region image sample as a training sample, and inputting the training sample into the LSTM model to obtain a prediction result of a pronunciation phoneme sequence;

obtaining the value of an LSTM model loss function according to the training sample and the prediction result;

and obtaining the phoneme sequence extraction model with complete training according to the value of the LSTM model loss function.

In some possible implementation manners, the sentence inference model with the subspace sparse self-attention mechanism comprises a sentence inference network module, a language model judgment module and an inference sentence sequence module; the inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence comprises:

inputting the pronunciation phoneme sequence into the sentence reasoning network module for sentence reasoning to obtain all transition sentence subsequences;

inputting the all transition sentence subsequences into the language model judging module, and calculating confusion values of the all transition sentence subsequences according to the confusion;

and selecting the transition sentence subsequence with the minimum confusion value based on an inference sentence sequence module to obtain a predicted target sentence sequence.

In some possible implementations, the sentence inference network module includes a plurality of sentence subsequence inference sub-modules, each of which includes a multi-headed self-attention mechanism module with a mask and a feed-forward network module; inputting the pronunciation phoneme sequence into the sentence reasoning network module for sentence reasoning to obtain all transition sentence subsequences, including:

and converting the vector corresponding to the pronunciation phoneme sequence into a corresponding sentence subsequence based on a multi-head self-attention mechanism module with a mask and a feedforward network module in each sentence subsequence inference submodule.

In some possible implementations, the converting the vector corresponding to the pronunciation phoneme sequence into the corresponding sentence subsequence based on the multi-headed self-attention mechanism module with mask and the feedforward network module in each sentence subsequence inference sub-module includes:

processing the phoneme sequence by using a multi-head self-attention mechanism module with a mask to obtain a first vector;

multiplying the first vector by a vector corresponding to the pronunciation factor sequence to obtain a second vector;

normalizing the second vector layer and inputting the second vector layer into a feedforward neural network module to perform dimensionality reduction operation to obtain a third vector;

and multiplying the third vector and the second vector, and performing layer normalization to obtain all transition sentence subsequences.

In some possible implementations, the module execution process of the masked multi-head self-attention mechanism module includes:

the input vector is linearly changed to obtain three matrixes of query Q, key K and value V;

convolving the three matrixes to respectively obtain corresponding word vectors;

performing dimensionality reduction operation on the word vector of the key K and the word vector of the value V;

and calculating the word vector of the query Q, the word vector of the key K and the word vector of the value V to obtain an output vector.

In some possible implementations, inputting the entire subsequence of transition sentences into the language model decision module includes:

determining words corresponding to the sentence sub-sequence according to whether the pronunciation phoneme sequence corresponding to the sentence sub-sequence comprises a subset of inclusions:

if the pronunciation phoneme sequence corresponding to the sentence subsequence only contains a subset and the subset matches a word, outputting the word;

if the pronunciation phoneme sequence corresponding to the sentence subsequence only contains a subset and the subset matches a plurality of words, outputting the word with the maximum expected value;

if the pronunciation phoneme sequence corresponding to the sentence subsequence contains a plurality of subsets, calculating a confusion value of the sentence subsequence based on the confusion degree.

In some possible implementations, calculating a confusion value for the sentence subsequence in accordance with the confusion includes:

matching words corresponding to a first subset of the sentence subsequence with words corresponding to a second subset to obtain a new subset, and calculating a first confusion value between the words based on the confusion degree to obtain a preset number of word combinations with the lowest first confusion value;

and matching the words corresponding to the new subset with the words corresponding to the next subset to obtain a new subset, calculating a second confusion value between the new subset and the next subset to obtain a preset number of word combinations with the lowest second confusion value, and obtaining all target sentence sequences until all the subsets are matched.

In another aspect, the present invention further provides a lip language identification system based on a subspace sparse attention mechanism, which includes a microprocessor and a memory connected to each other, and is characterized in that the microprocessor is programmed or configured to execute the steps in the lip language identification system based on the subspace sparse attention mechanism according to any one of the above implementations.

In another aspect, the present invention further provides a computer-readable storage medium for storing a computer-readable program or instructions, where the program or instructions, when executed by a processor, can implement the steps in the lip language method based on the subspace sparse attention mechanism described in any one of the above implementation manners.

The beneficial effects of adopting the above embodiment are: the lip language method based on the subspace sparse attention mechanism comprises the steps of firstly obtaining a lip region image sequence, extracting and obtaining a lip feature sequence based on the lip region image sequence, then inputting the lip feature sequence into a phoneme sequence extraction model which is completely trained to obtain a pronunciation phoneme sequence, and finally inputting the pronunciation phoneme sequence into a sentence inference model which is built with the subspace sparse attention mechanism to obtain a target sentence sequence. The invention utilizes a special attention mechanism to enhance the context information, realizes the prediction of the long sentence sequence in a forward operation, thereby greatly improving the reasoning speed and the accuracy and providing technical support for related applications.

Drawings

FIG. 1 is a flowchart of a method of an embodiment of a lip language method based on a subspace sparse attention mechanism according to the present invention;

fig. 2 is a block diagram of an embodiment of a multi-head self-attention mechanism module with a mask according to the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the schematic drawings are not necessarily to scale. The flowcharts used in this invention illustrate operations performed in accordance with some embodiments of the present invention. It should be understood that the operations of the flow diagrams may be performed out of order, and that steps without logical context may be reversed in order or performed concurrently. One skilled in the art, under the direction of this summary, may add one or more other operations to, or remove one or more operations from, the flowchart.

In the description of the embodiment of the present invention, "and/or" describes an association relationship of an association object, which means that three relationships may exist, for example: a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone.

Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor systems and/or microcontroller systems.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.

Fig. 1 is a schematic flowchart of an embodiment of a lip language method based on a subspace sparse attention mechanism provided in the present invention, and as shown in fig. 1, the lip language method based on the subspace sparse attention mechanism includes:

s101, a lip region image sequence is obtained, and a lip feature sequence is extracted and obtained on the basis of the lip region image sequence;

s102, inputting the lip feature sequence into a preset phoneme sequence extraction model with complete training to obtain a pronunciation phoneme sequence corresponding to the lip feature sequence;

s103, inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence.

Compared with the prior art, the lip language method based on the subspace sparse attention mechanism provided by the embodiment of the invention comprises the steps of firstly obtaining a lip region image sequence, extracting and obtaining a lip feature sequence based on the lip region image sequence, then inputting the lip feature sequence into a phoneme sequence extraction model which is completely trained, thus obtaining a pronunciation phoneme sequence, and finally inputting the pronunciation phoneme sequence into a sentence inference model which is built with the subspace sparse attention mechanism to obtain a target sentence sequence. The context information is enhanced by using a special attention mechanism, and the long sentence sequence is predicted in a forward operation, so that the reasoning speed and the accuracy are greatly improved, and the technical support is provided for related applications.

It can be understood that after the target sentence sequence is obtained, the purpose of converting the vocal features of the lip region into characters is achieved, and therefore lip language recognition is achieved through character recognition.

In step S101, acquiring a lip region image sequence, and extracting a lip feature sequence based on the lip region image sequence, the method includes:

cutting the collected video sequence data into a preset length, and adjusting the video sequence data into a preset frame numerical value;

sequentially carrying out face detection on each frame of image, then carrying out key point detection on the detected face image, determining the mouth angle position in each face image according to the obtained face key point marking information, carrying out screenshot to obtain a corresponding lip region image, wherein a plurality of lip region images form a lip region image sequence;

and calculating the offset and the rotation factor after the lip region images are aligned with the preset standard images to obtain lip feature vectors corresponding to the lip region images, and splicing the lip region images in sequence to obtain a corresponding lip feature sequence.

It should be noted that the long-short term memory network, i.e., the LSTM model, is a time-recursive neural network, and is specifically designed to solve the long-term dependence problem of the general Recurrent Neural Network (RNN). In some embodiments of the present invention, determining the trained phoneme sequence extraction model comprises:

obtaining a value of an LSTM model loss function according to the training sample and the prediction result;

In some embodiments of the present invention, the sentence inference model with built subspace sparse self-attention mechanism includes a sentence inference network module, a language model decision module and an inference sentence sequence module; the inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence comprises:

In the long sentence inference, it is necessary to better realize the inference effect. In some embodiments of the present invention, the sentence inference network module comprises a plurality of sentence subsequence inference submodules, each of which comprises a multi-head self-attention mechanism module with a mask and a feed-forward network module; inputting the pronunciation phoneme sequence into the sentence reasoning network module, and performing sentence reasoning to obtain all transition sentence subsequences, including:

It should be noted that the masked multi-head self-attention mechanism module includes a plurality of head self-attention blocks, the masked multi-head self-attention mechanism hides future information by using a mask layer, and the feedforward neural network is implemented by two one-dimensional convolution and layer normalization. In some embodiments of the present invention, the multi-head self-attention mechanism module with a mask and the feedforward network module in each sentence subsequence inference sub-module are used to convert the corresponding vector of the pronunciation phoneme sequence into a corresponding sentence subsequence, including:

In some embodiments of the present invention, the module execution process of the masked multi-head self-attention mechanism module comprises:

In an embodiment of the present invention, fig. 2 is a block diagram of an embodiment of a multi-head self-attention mechanism module with a mask provided in the present invention, where a calculation flow in each head self-attention block is as follows:

step one, inputting vector X belongs to R ^C×H×W Where C represents the number of channels, H and W represent the height and width of the space, respectively, and a one-dimensional convolution of three matrices Q, K and V is set

W _θ ，W _γ Directly generated by convolutionDifferent embedded representations

Wherein +>

For the dimensions of the embedded representation, the generation process formula is as follows:

step two, three obtained embedded representations are obtained

Theta and gamma are folded into->

Size, wherein N = H · W;

step three, performing dimensionality reduction operation on the obtained embedded representations theta and gamma, wherein the dimensionality of the embedded representations theta and gamma is S from N, and the dimensionality reduction operation specifically comprises the following steps:

the dimension is reduced from N to S, and representative points need to be sampled and selected from theta and gamma, not all points need to be input, so that the output size is reduced to be half of the original size, and the expression formula is as follows:

X _S ＝Maxpool(ELU(Conv1d[X _N ))，

wherein Conv1d (-) is 1-dimensional convolution, ELU (-) is an activation function, maxpool (-) is a maximum pooling layer with the step length of 2, and a plurality of key features are selected to further reduce the dimension and extract the priority feature occupying the main effect.

Step four, the embedded representation

And calculating the embedded expressions theta and gamma after dimensionality reduction to obtain a final output expression vector O, and the stepThe method comprises the following steps:

first represented by an embedding

And θ generate an attention matrix V ∈ R ^N×N The calculation formula is->

Is normalized to get->

Next, the attention matrix V is multiplied by the embedded representation γ, resulting in a final output representation vector O, i.e., -based on the value of the embedded representation γ>

Wherein, according to the calculation process, the dimension S after the dimension reduction of the embedded representation theta and gamma is set to be far smaller than

When the dimension N is larger than the value S, the output dimension can be ensured to be unchanged in the sparse attention moment array, and the internal process can be described by the following formula:

it should be noted that the confusion (perplexity) is used to evaluate the quality of the language model, and the confusion is a value obtained by performing an exponential operation on the cross entropy loss function. In some embodiments of the invention, inputting the full subsequence of transition sentences into the language model decision module comprises:

if the pronunciation phoneme sequence corresponding to the sentence subsequence only contains a subset and the subset is matched with a word, outputting the word;

if the pronunciation phoneme sequence corresponding to the sentence subsequence only contains a subset and the subset is matched with a plurality of words, outputting the word with the maximum expected value;

In a specific embodiment of the present invention, if the pronunciation phoneme sequence corresponding to the sentence sub-sequence contains a plurality of subsets, in the first iteration, the words matching the first two sets will be combined in a possible manner based on the following rules;

1) The 50 combinations with the lowest confusion value are retained;

2) These combinations may be matched to the next subset of phonemes;

3) The 50 combinations with the lowest confusion value are retained and the iteration continues on the other sets of sequences until the end of the sequence.

In some embodiments of the invention, calculating a confusion value for the sentence subsequence in accordance with the confusion comprises:

In a specific embodiment of the present invention, the confusion value calculation steps are as follows:

firstly, obtaining the probability relation between the pronunciation phoneme sequence and the corresponding sentence, and then selecting the word combination with the maximum probability to obtain the reasoning sentence. The probabilistic relationship between the pronunciation phoneme sequence and the corresponding sentence is as follows:

in the formula: where P is the phoneme sequence, P _i Corresponding to the ith subset, W _C Representing any given combination of words, w _i Corresponding to each word of the match in the word string.

Then, the process of selecting the word combination with the highest probability to obtain the reasoning sentence is as follows:

is the combination with the highest probability that takes into account the subset of phonemes of each combination C belonging to the set of combinations C ^* 。

Entropy of information and P (w) for each word ₁ ,w ₂ ,…,w _N ) In which (w) ₁ ,w ₂ ,…,w _N ) Belonging to a word set W, and summing possible word sequences, i.e. transition sentence subsequences

The method comprises the following specific steps:

in the formula, N represents the number of words, and PP represents a confusion value.

Accordingly, the embodiment of the present application further provides a lip language recognition system based on a subspace sparse attention mechanism, which includes a processor and a memory connected to each other, where the processor is programmed or configured to execute the steps or functions in the lip language recognition method based on a subspace sparse attention mechanism provided in the foregoing method embodiments.

In summary, according to the lip language method, the lip language system and the computer-readable storage medium based on the subspace sparse attention mechanism provided by the invention, the lip region image sequence is firstly obtained, the lip feature sequence is obtained through extraction based on the lip region image sequence, then the lip feature sequence is input into the phoneme sequence extraction model which is trained completely, so that the pronunciation phoneme sequence is obtained, and finally the pronunciation phoneme sequence is input into the sentence inference model which is built with the subspace sparse attention mechanism, so that the target sentence sequence is obtained. The invention utilizes a special attention mechanism to enhance the context information, realizes the prediction of the long sentence sequence in a forward operation, thereby greatly improving the reasoning speed and the accuracy and providing technical support for related applications.

Accordingly, the present application further provides a computer-readable storage medium, which is used for storing a computer-readable program or instruction, and when the program or instruction is executed by a processor, the step or the function in the lip language identification method based on the subspace sparse attention mechanism provided by the above method embodiments can be implemented.

Those skilled in the art will appreciate that all or part of the processes for implementing the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, for instructing the relevant hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.

While the invention has been described with reference to specific preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.

Claims

1. A lip language identification method based on a subspace sparse attention mechanism is characterized by comprising the following steps:

and inputting the pronunciation phoneme sequence into a sentence reasoning model with a subspace sparse self-attention mechanism to obtain a target sentence sequence.

2. The method for identifying lip language based on subspace sparse attention mechanism according to claim 1, wherein determining the trained phoneme sequence extraction model comprises:

3. The lip language identification method based on the subspace sparse attention mechanism as claimed in claim 1, wherein the sentence inference model with the subspace sparse attention mechanism comprises a sentence inference network module, a language model decision module and an inference sentence sequence module; the step of inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence comprises the following steps:

inputting the all transition sentence subsequences into the language model judging module, and calculating a confusion value of the all transition sentence subsequences according to the confusion;

4. The lip language identification method based on the subspace sparse attention mechanism as claimed in claim 3, wherein the sentence inference network module comprises a plurality of sentence subsequence inference sub-modules, each of which comprises a multi-head self-attention mechanism module with a mask and a feed-forward network module; inputting the pronunciation phoneme sequence into the sentence reasoning network module for sentence reasoning to obtain all transition sentence subsequences, including:

5. The method for recognizing lip language based on subspace sparse attention mechanism as claimed in claim 4, wherein the step of converting the vectors corresponding to the phoneme of pronunciation into corresponding sentence subsequence based on the multi-headed self-attention mechanism module with mask and the feedforward network module in each sentence subsequence inference submodule comprises:

6. The lip language identification method based on the subspace sparse attention mechanism as claimed in claim 5, wherein the module execution process of the multi-head self-attention mechanism module with the mask comprises:

obtaining three matrixes of query Q, key K and value V by linear change of the input vector;

7. The method according to claim 3, wherein the inputting the entire subsequence of transition sentences into the language model determination module comprises:

determining words corresponding to the sentence sub-sequence according to whether the pronunciation phoneme sequence corresponding to the sentence sub-sequence comprises a subsumption:

8. The method for lip language recognition based on subspace sparse attention mechanism according to claim 1, wherein calculating a confusion value of the sentence subsequence according to a confusion degree comprises:

matching words corresponding to the first subset of the sentence subsequence with words corresponding to the second subset to obtain a new subset, and calculating a first confusion value between the words based on the confusion degree to obtain a preset number of word combinations with the lowest first confusion value;

9. A lip language identification system based on a subspace sparse attention mechanism, comprising a processor and a memory connected to each other, characterized in that the processor is programmed or configured to perform the steps of the lip language identification method based on a subspace sparse attention mechanism as claimed in any one of claims 1 to 8.

10. A computer-readable storage medium for storing a computer-readable program or instructions, which when executed by a processor, is capable of implementing the steps in the method for identifying lip language based on a subspace sparse attention mechanism as claimed in any one of claims 1 to 8.