CN115910065A - Lip language identification method, system and medium based on subspace sparse attention mechanism - Google Patents

Lip language identification method, system and medium based on subspace sparse attention mechanism Download PDF

Info

Publication number
CN115910065A
CN115910065A CN202211518304.8A CN202211518304A CN115910065A CN 115910065 A CN115910065 A CN 115910065A CN 202211518304 A CN202211518304 A CN 202211518304A CN 115910065 A CN115910065 A CN 115910065A
Authority
CN
China
Prior art keywords
sentence
sequence
attention mechanism
vector
lip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211518304.8A
Other languages
Chinese (zh)
Inventor
陈亚雄
赵怡晨
路雄博
邓梦涵
熊盛武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Research Institute Of Wuhan University Of Technology
Original Assignee
Chongqing Research Institute Of Wuhan University Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Research Institute Of Wuhan University Of Technology filed Critical Chongqing Research Institute Of Wuhan University Of Technology
Priority to CN202211518304.8A priority Critical patent/CN115910065A/en
Publication of CN115910065A publication Critical patent/CN115910065A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a lip language identification method, a system and a medium based on a subspace sparse attention mechanism, wherein the method comprises the following steps: obtaining a lip region image sequence, and extracting a lip feature sequence based on the lip region image sequence; inputting the lip feature sequence into a preset phoneme sequence extraction model with complete training to obtain a pronunciation phoneme sequence corresponding to the lip feature sequence; and inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence. The invention enhances the context information by constructing a special attention mechanism, realizes the prediction of the long sentence sequence in a forward operation, and greatly improves the reasoning speed and the accuracy.

Description

Lip language identification method, system and medium based on subspace sparse attention mechanism
Technical Field
The invention relates to the technical field of computer vision and natural language processing, in particular to a lip language method and system for a subspace sparse attention mechanism and a computer readable storage medium.
Background
Lip language identification (also called lip reading) refers to a process of obtaining the speaking content of a speaker by interpreting and analyzing lip movement information of the speaker, and aims to supplement auditory information by using visual information so that the expression content of people can be accurately obtained under the condition of hearing damage. The process of lip language recognition is usually to input silent video, output speech or text content. The lip language identification technology plays an extremely important role in the fields of deaf-mute assistance, life service, public security and the like.
The development of deep learning lays a solid foundation for a lip language recognition technology, but as the length of an input video increases, a deep model is difficult to train, a specific characteristic learning method is usually needed, the reasoning process is complex and time-consuming in calculation, and the optimization of a large number of parameters in the model is also involved. Therefore, with the rapid increase of the demand for lip language identification, how to improve the speed and accuracy of identification while the identification video length increases has become a critical task facing lip language identification.
Disclosure of Invention
In view of the above, it is desirable to provide a method, a system and a medium for lip language based on subspace sparse attention mechanism, so as to solve the problem that the training process is difficult to converge quickly due to the long length of the sentence sequence.
In one aspect, the invention provides a lip language identification method based on a subspace sparse attention mechanism, which comprises the following steps:
obtaining a lip region image sequence, and extracting a lip feature sequence based on the lip region image sequence;
inputting the lip feature sequence into a preset phoneme sequence extraction model with complete training to obtain a pronunciation phoneme sequence corresponding to the lip feature sequence;
and inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence.
In some possible implementations, determining the trained phoneme sequence extraction model includes:
initializing an LSTM model, taking a lip feature sequence corresponding to a lip region image sample as a training sample, and inputting the training sample into the LSTM model to obtain a prediction result of a pronunciation phoneme sequence;
obtaining the value of an LSTM model loss function according to the training sample and the prediction result;
and obtaining the phoneme sequence extraction model with complete training according to the value of the LSTM model loss function.
In some possible implementation manners, the sentence inference model with the subspace sparse self-attention mechanism comprises a sentence inference network module, a language model judgment module and an inference sentence sequence module; the inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence comprises:
inputting the pronunciation phoneme sequence into the sentence reasoning network module for sentence reasoning to obtain all transition sentence subsequences;
inputting the all transition sentence subsequences into the language model judging module, and calculating confusion values of the all transition sentence subsequences according to the confusion;
and selecting the transition sentence subsequence with the minimum confusion value based on an inference sentence sequence module to obtain a predicted target sentence sequence.
In some possible implementations, the sentence inference network module includes a plurality of sentence subsequence inference sub-modules, each of which includes a multi-headed self-attention mechanism module with a mask and a feed-forward network module; inputting the pronunciation phoneme sequence into the sentence reasoning network module for sentence reasoning to obtain all transition sentence subsequences, including:
and converting the vector corresponding to the pronunciation phoneme sequence into a corresponding sentence subsequence based on a multi-head self-attention mechanism module with a mask and a feedforward network module in each sentence subsequence inference submodule.
In some possible implementations, the converting the vector corresponding to the pronunciation phoneme sequence into the corresponding sentence subsequence based on the multi-headed self-attention mechanism module with mask and the feedforward network module in each sentence subsequence inference sub-module includes:
processing the phoneme sequence by using a multi-head self-attention mechanism module with a mask to obtain a first vector;
multiplying the first vector by a vector corresponding to the pronunciation factor sequence to obtain a second vector;
normalizing the second vector layer and inputting the second vector layer into a feedforward neural network module to perform dimensionality reduction operation to obtain a third vector;
and multiplying the third vector and the second vector, and performing layer normalization to obtain all transition sentence subsequences.
In some possible implementations, the module execution process of the masked multi-head self-attention mechanism module includes:
the input vector is linearly changed to obtain three matrixes of query Q, key K and value V;
convolving the three matrixes to respectively obtain corresponding word vectors;
performing dimensionality reduction operation on the word vector of the key K and the word vector of the value V;
and calculating the word vector of the query Q, the word vector of the key K and the word vector of the value V to obtain an output vector.
In some possible implementations, inputting the entire subsequence of transition sentences into the language model decision module includes:
determining words corresponding to the sentence sub-sequence according to whether the pronunciation phoneme sequence corresponding to the sentence sub-sequence comprises a subset of inclusions:
if the pronunciation phoneme sequence corresponding to the sentence subsequence only contains a subset and the subset matches a word, outputting the word;
if the pronunciation phoneme sequence corresponding to the sentence subsequence only contains a subset and the subset matches a plurality of words, outputting the word with the maximum expected value;
if the pronunciation phoneme sequence corresponding to the sentence subsequence contains a plurality of subsets, calculating a confusion value of the sentence subsequence based on the confusion degree.
In some possible implementations, calculating a confusion value for the sentence subsequence in accordance with the confusion includes:
matching words corresponding to a first subset of the sentence subsequence with words corresponding to a second subset to obtain a new subset, and calculating a first confusion value between the words based on the confusion degree to obtain a preset number of word combinations with the lowest first confusion value;
and matching the words corresponding to the new subset with the words corresponding to the next subset to obtain a new subset, calculating a second confusion value between the new subset and the next subset to obtain a preset number of word combinations with the lowest second confusion value, and obtaining all target sentence sequences until all the subsets are matched.
In another aspect, the present invention further provides a lip language identification system based on a subspace sparse attention mechanism, which includes a microprocessor and a memory connected to each other, and is characterized in that the microprocessor is programmed or configured to execute the steps in the lip language identification system based on the subspace sparse attention mechanism according to any one of the above implementations.
In another aspect, the present invention further provides a computer-readable storage medium for storing a computer-readable program or instructions, where the program or instructions, when executed by a processor, can implement the steps in the lip language method based on the subspace sparse attention mechanism described in any one of the above implementation manners.
The beneficial effects of adopting the above embodiment are: the lip language method based on the subspace sparse attention mechanism comprises the steps of firstly obtaining a lip region image sequence, extracting and obtaining a lip feature sequence based on the lip region image sequence, then inputting the lip feature sequence into a phoneme sequence extraction model which is completely trained to obtain a pronunciation phoneme sequence, and finally inputting the pronunciation phoneme sequence into a sentence inference model which is built with the subspace sparse attention mechanism to obtain a target sentence sequence. The invention utilizes a special attention mechanism to enhance the context information, realizes the prediction of the long sentence sequence in a forward operation, thereby greatly improving the reasoning speed and the accuracy and providing technical support for related applications.
Drawings
FIG. 1 is a flowchart of a method of an embodiment of a lip language method based on a subspace sparse attention mechanism according to the present invention;
fig. 2 is a block diagram of an embodiment of a multi-head self-attention mechanism module with a mask according to the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the schematic drawings are not necessarily to scale. The flowcharts used in this invention illustrate operations performed in accordance with some embodiments of the present invention. It should be understood that the operations of the flow diagrams may be performed out of order, and that steps without logical context may be reversed in order or performed concurrently. One skilled in the art, under the direction of this summary, may add one or more other operations to, or remove one or more operations from, the flowchart.
In the description of the embodiment of the present invention, "and/or" describes an association relationship of an association object, which means that three relationships may exist, for example: a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone.
Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor systems and/or microcontroller systems.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.
Fig. 1 is a schematic flowchart of an embodiment of a lip language method based on a subspace sparse attention mechanism provided in the present invention, and as shown in fig. 1, the lip language method based on the subspace sparse attention mechanism includes:
s101, a lip region image sequence is obtained, and a lip feature sequence is extracted and obtained on the basis of the lip region image sequence;
s102, inputting the lip feature sequence into a preset phoneme sequence extraction model with complete training to obtain a pronunciation phoneme sequence corresponding to the lip feature sequence;
s103, inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence.
Compared with the prior art, the lip language method based on the subspace sparse attention mechanism provided by the embodiment of the invention comprises the steps of firstly obtaining a lip region image sequence, extracting and obtaining a lip feature sequence based on the lip region image sequence, then inputting the lip feature sequence into a phoneme sequence extraction model which is completely trained, thus obtaining a pronunciation phoneme sequence, and finally inputting the pronunciation phoneme sequence into a sentence inference model which is built with the subspace sparse attention mechanism to obtain a target sentence sequence. The context information is enhanced by using a special attention mechanism, and the long sentence sequence is predicted in a forward operation, so that the reasoning speed and the accuracy are greatly improved, and the technical support is provided for related applications.
It can be understood that after the target sentence sequence is obtained, the purpose of converting the vocal features of the lip region into characters is achieved, and therefore lip language recognition is achieved through character recognition.
In step S101, acquiring a lip region image sequence, and extracting a lip feature sequence based on the lip region image sequence, the method includes:
cutting the collected video sequence data into a preset length, and adjusting the video sequence data into a preset frame numerical value;
sequentially carrying out face detection on each frame of image, then carrying out key point detection on the detected face image, determining the mouth angle position in each face image according to the obtained face key point marking information, carrying out screenshot to obtain a corresponding lip region image, wherein a plurality of lip region images form a lip region image sequence;
and calculating the offset and the rotation factor after the lip region images are aligned with the preset standard images to obtain lip feature vectors corresponding to the lip region images, and splicing the lip region images in sequence to obtain a corresponding lip feature sequence.
It should be noted that the long-short term memory network, i.e., the LSTM model, is a time-recursive neural network, and is specifically designed to solve the long-term dependence problem of the general Recurrent Neural Network (RNN). In some embodiments of the present invention, determining the trained phoneme sequence extraction model comprises:
initializing an LSTM model, taking a lip feature sequence corresponding to a lip region image sample as a training sample, and inputting the training sample into the LSTM model to obtain a prediction result of a pronunciation phoneme sequence;
obtaining a value of an LSTM model loss function according to the training sample and the prediction result;
and obtaining the phoneme sequence extraction model with complete training according to the value of the LSTM model loss function.
In some embodiments of the present invention, the sentence inference model with built subspace sparse self-attention mechanism includes a sentence inference network module, a language model decision module and an inference sentence sequence module; the inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence comprises:
inputting the pronunciation phoneme sequence into the sentence reasoning network module for sentence reasoning to obtain all transition sentence subsequences;
inputting the all transition sentence subsequences into the language model judging module, and calculating confusion values of the all transition sentence subsequences according to the confusion;
and selecting the transition sentence subsequence with the minimum confusion value based on an inference sentence sequence module to obtain a predicted target sentence sequence.
In the long sentence inference, it is necessary to better realize the inference effect. In some embodiments of the present invention, the sentence inference network module comprises a plurality of sentence subsequence inference submodules, each of which comprises a multi-head self-attention mechanism module with a mask and a feed-forward network module; inputting the pronunciation phoneme sequence into the sentence reasoning network module, and performing sentence reasoning to obtain all transition sentence subsequences, including:
and converting the vector corresponding to the pronunciation phoneme sequence into a corresponding sentence subsequence based on a multi-head self-attention mechanism module with a mask and a feedforward network module in each sentence subsequence inference submodule.
It should be noted that the masked multi-head self-attention mechanism module includes a plurality of head self-attention blocks, the masked multi-head self-attention mechanism hides future information by using a mask layer, and the feedforward neural network is implemented by two one-dimensional convolution and layer normalization. In some embodiments of the present invention, the multi-head self-attention mechanism module with a mask and the feedforward network module in each sentence subsequence inference sub-module are used to convert the corresponding vector of the pronunciation phoneme sequence into a corresponding sentence subsequence, including:
processing the phoneme sequence by using a multi-head self-attention mechanism module with a mask to obtain a first vector;
multiplying the first vector by a vector corresponding to the pronunciation factor sequence to obtain a second vector;
normalizing the second vector layer and inputting the second vector layer into a feedforward neural network module to perform dimensionality reduction operation to obtain a third vector;
and multiplying the third vector and the second vector, and performing layer normalization to obtain all transition sentence subsequences.
In some embodiments of the present invention, the module execution process of the masked multi-head self-attention mechanism module comprises:
the input vector is linearly changed to obtain three matrixes of query Q, key K and value V;
convolving the three matrixes to respectively obtain corresponding word vectors;
performing dimensionality reduction operation on the word vector of the key K and the word vector of the value V;
and calculating the word vector of the query Q, the word vector of the key K and the word vector of the value V to obtain an output vector.
In an embodiment of the present invention, fig. 2 is a block diagram of an embodiment of a multi-head self-attention mechanism module with a mask provided in the present invention, where a calculation flow in each head self-attention block is as follows:
step one, inputting vector X belongs to R C×H×W Where C represents the number of channels, H and W represent the height and width of the space, respectively, and a one-dimensional convolution of three matrices Q, K and V is set
Figure BDA0003972642340000091
W θ ,W γ Directly generated by convolutionDifferent embedded representations
Figure BDA0003972642340000092
Figure BDA0003972642340000093
Wherein +>
Figure BDA0003972642340000094
For the dimensions of the embedded representation, the generation process formula is as follows:
Figure BDA0003972642340000095
step two, three obtained embedded representations are obtained
Figure BDA0003972642340000096
Theta and gamma are folded into->
Figure BDA0003972642340000097
Size, wherein N = H · W;
step three, performing dimensionality reduction operation on the obtained embedded representations theta and gamma, wherein the dimensionality of the embedded representations theta and gamma is S from N, and the dimensionality reduction operation specifically comprises the following steps:
the dimension is reduced from N to S, and representative points need to be sampled and selected from theta and gamma, not all points need to be input, so that the output size is reduced to be half of the original size, and the expression formula is as follows:
X S =Maxpool(ELU(Conv1d[X N )),
wherein Conv1d (-) is 1-dimensional convolution, ELU (-) is an activation function, maxpool (-) is a maximum pooling layer with the step length of 2, and a plurality of key features are selected to further reduce the dimension and extract the priority feature occupying the main effect.
Step four, the embedded representation
Figure BDA0003972642340000101
And calculating the embedded expressions theta and gamma after dimensionality reduction to obtain a final output expression vector O, and the stepThe method comprises the following steps:
first represented by an embedding
Figure BDA0003972642340000102
And θ generate an attention matrix V ∈ R N×N The calculation formula is->
Figure BDA0003972642340000103
Is normalized to get->
Figure BDA0003972642340000104
Next, the attention matrix V is multiplied by the embedded representation γ, resulting in a final output representation vector O, i.e., -based on the value of the embedded representation γ>
Figure BDA0003972642340000105
Wherein, according to the calculation process, the dimension S after the dimension reduction of the embedded representation theta and gamma is set to be far smaller than
Figure BDA0003972642340000106
When the dimension N is larger than the value S, the output dimension can be ensured to be unchanged in the sparse attention moment array, and the internal process can be described by the following formula:
Figure BDA0003972642340000107
it should be noted that the confusion (perplexity) is used to evaluate the quality of the language model, and the confusion is a value obtained by performing an exponential operation on the cross entropy loss function. In some embodiments of the invention, inputting the full subsequence of transition sentences into the language model decision module comprises:
determining words corresponding to the sentence sub-sequence according to whether the pronunciation phoneme sequence corresponding to the sentence sub-sequence comprises a subset of inclusions:
if the pronunciation phoneme sequence corresponding to the sentence subsequence only contains a subset and the subset is matched with a word, outputting the word;
if the pronunciation phoneme sequence corresponding to the sentence subsequence only contains a subset and the subset is matched with a plurality of words, outputting the word with the maximum expected value;
if the pronunciation phoneme sequence corresponding to the sentence subsequence contains a plurality of subsets, calculating a confusion value of the sentence subsequence based on the confusion degree.
In a specific embodiment of the present invention, if the pronunciation phoneme sequence corresponding to the sentence sub-sequence contains a plurality of subsets, in the first iteration, the words matching the first two sets will be combined in a possible manner based on the following rules;
1) The 50 combinations with the lowest confusion value are retained;
2) These combinations may be matched to the next subset of phonemes;
3) The 50 combinations with the lowest confusion value are retained and the iteration continues on the other sets of sequences until the end of the sequence.
In some embodiments of the invention, calculating a confusion value for the sentence subsequence in accordance with the confusion comprises:
matching words corresponding to a first subset of the sentence subsequence with words corresponding to a second subset to obtain a new subset, and calculating a first confusion value between the words based on the confusion degree to obtain a preset number of word combinations with the lowest first confusion value;
and matching the words corresponding to the new subset with the words corresponding to the next subset to obtain a new subset, calculating a second confusion value between the new subset and the next subset to obtain a preset number of word combinations with the lowest second confusion value, and obtaining all target sentence sequences until all the subsets are matched.
In a specific embodiment of the present invention, the confusion value calculation steps are as follows:
firstly, obtaining the probability relation between the pronunciation phoneme sequence and the corresponding sentence, and then selecting the word combination with the maximum probability to obtain the reasoning sentence. The probabilistic relationship between the pronunciation phoneme sequence and the corresponding sentence is as follows:
Figure BDA0003972642340000111
Figure BDA0003972642340000112
in the formula: where P is the phoneme sequence, P i Corresponding to the ith subset, W C Representing any given combination of words, w i Corresponding to each word of the match in the word string.
Then, the process of selecting the word combination with the highest probability to obtain the reasoning sentence is as follows:
Figure BDA0003972642340000121
Figure BDA0003972642340000122
Figure BDA0003972642340000123
is the combination with the highest probability that takes into account the subset of phonemes of each combination C belonging to the set of combinations C *
Entropy of information and P (w) for each word 1 ,w 2 ,…,w N ) In which (w) 1 ,w 2 ,…,w N ) Belonging to a word set W, and summing possible word sequences, i.e. transition sentence subsequences
Figure BDA0003972642340000124
The method comprises the following specific steps:
Figure BDA0003972642340000125
Figure BDA0003972642340000126
in the formula, N represents the number of words, and PP represents a confusion value.
Accordingly, the embodiment of the present application further provides a lip language recognition system based on a subspace sparse attention mechanism, which includes a processor and a memory connected to each other, where the processor is programmed or configured to execute the steps or functions in the lip language recognition method based on a subspace sparse attention mechanism provided in the foregoing method embodiments.
In summary, according to the lip language method, the lip language system and the computer-readable storage medium based on the subspace sparse attention mechanism provided by the invention, the lip region image sequence is firstly obtained, the lip feature sequence is obtained through extraction based on the lip region image sequence, then the lip feature sequence is input into the phoneme sequence extraction model which is trained completely, so that the pronunciation phoneme sequence is obtained, and finally the pronunciation phoneme sequence is input into the sentence inference model which is built with the subspace sparse attention mechanism, so that the target sentence sequence is obtained. The invention utilizes a special attention mechanism to enhance the context information, realizes the prediction of the long sentence sequence in a forward operation, thereby greatly improving the reasoning speed and the accuracy and providing technical support for related applications.
Accordingly, the present application further provides a computer-readable storage medium, which is used for storing a computer-readable program or instruction, and when the program or instruction is executed by a processor, the step or the function in the lip language identification method based on the subspace sparse attention mechanism provided by the above method embodiments can be implemented.
Those skilled in the art will appreciate that all or part of the processes for implementing the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, for instructing the relevant hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.
While the invention has been described with reference to specific preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.

Claims (10)

1. A lip language identification method based on a subspace sparse attention mechanism is characterized by comprising the following steps:
obtaining a lip region image sequence, and extracting a lip feature sequence based on the lip region image sequence;
inputting the lip feature sequence into a preset phoneme sequence extraction model with complete training to obtain a pronunciation phoneme sequence corresponding to the lip feature sequence;
and inputting the pronunciation phoneme sequence into a sentence reasoning model with a subspace sparse self-attention mechanism to obtain a target sentence sequence.
2. The method for identifying lip language based on subspace sparse attention mechanism according to claim 1, wherein determining the trained phoneme sequence extraction model comprises:
initializing an LSTM model, taking a lip feature sequence corresponding to a lip region image sample as a training sample, and inputting the training sample into the LSTM model to obtain a prediction result of a pronunciation phoneme sequence;
obtaining the value of an LSTM model loss function according to the training sample and the prediction result;
and obtaining the phoneme sequence extraction model with complete training according to the value of the LSTM model loss function.
3. The lip language identification method based on the subspace sparse attention mechanism as claimed in claim 1, wherein the sentence inference model with the subspace sparse attention mechanism comprises a sentence inference network module, a language model decision module and an inference sentence sequence module; the step of inputting the pronunciation phoneme sequence into a sentence inference model with a subspace sparse self-attention mechanism to obtain a target sentence sequence comprises the following steps:
inputting the pronunciation phoneme sequence into the sentence reasoning network module for sentence reasoning to obtain all transition sentence subsequences;
inputting the all transition sentence subsequences into the language model judging module, and calculating a confusion value of the all transition sentence subsequences according to the confusion;
and selecting the transition sentence subsequence with the minimum confusion value based on an inference sentence sequence module to obtain a predicted target sentence sequence.
4. The lip language identification method based on the subspace sparse attention mechanism as claimed in claim 3, wherein the sentence inference network module comprises a plurality of sentence subsequence inference sub-modules, each of which comprises a multi-head self-attention mechanism module with a mask and a feed-forward network module; inputting the pronunciation phoneme sequence into the sentence reasoning network module for sentence reasoning to obtain all transition sentence subsequences, including:
and converting the vector corresponding to the pronunciation phoneme sequence into a corresponding sentence subsequence based on a multi-head self-attention mechanism module with a mask and a feedforward network module in each sentence subsequence inference submodule.
5. The method for recognizing lip language based on subspace sparse attention mechanism as claimed in claim 4, wherein the step of converting the vectors corresponding to the phoneme of pronunciation into corresponding sentence subsequence based on the multi-headed self-attention mechanism module with mask and the feedforward network module in each sentence subsequence inference submodule comprises:
processing the phoneme sequence by using a multi-head self-attention mechanism module with a mask to obtain a first vector;
multiplying the first vector by a vector corresponding to the pronunciation factor sequence to obtain a second vector;
normalizing the second vector layer and inputting the second vector layer into a feedforward neural network module to perform dimensionality reduction operation to obtain a third vector;
and multiplying the third vector and the second vector, and performing layer normalization to obtain all transition sentence subsequences.
6. The lip language identification method based on the subspace sparse attention mechanism as claimed in claim 5, wherein the module execution process of the multi-head self-attention mechanism module with the mask comprises:
obtaining three matrixes of query Q, key K and value V by linear change of the input vector;
convolving the three matrixes to respectively obtain corresponding word vectors;
performing dimensionality reduction operation on the word vector of the key K and the word vector of the value V;
and calculating the word vector of the query Q, the word vector of the key K and the word vector of the value V to obtain an output vector.
7. The method according to claim 3, wherein the inputting the entire subsequence of transition sentences into the language model determination module comprises:
determining words corresponding to the sentence sub-sequence according to whether the pronunciation phoneme sequence corresponding to the sentence sub-sequence comprises a subsumption:
if the pronunciation phoneme sequence corresponding to the sentence subsequence only contains a subset and the subset is matched with a word, outputting the word;
if the pronunciation phoneme sequence corresponding to the sentence subsequence only contains a subset and the subset is matched with a plurality of words, outputting the word with the maximum expected value;
if the pronunciation phoneme sequence corresponding to the sentence subsequence contains a plurality of subsets, calculating a confusion value of the sentence subsequence based on the confusion degree.
8. The method for lip language recognition based on subspace sparse attention mechanism according to claim 1, wherein calculating a confusion value of the sentence subsequence according to a confusion degree comprises:
matching words corresponding to the first subset of the sentence subsequence with words corresponding to the second subset to obtain a new subset, and calculating a first confusion value between the words based on the confusion degree to obtain a preset number of word combinations with the lowest first confusion value;
and matching the words corresponding to the new subset with the words corresponding to the next subset to obtain a new subset, calculating a second confusion value between the new subset and the next subset to obtain a preset number of word combinations with the lowest second confusion value, and obtaining all target sentence sequences until all the subsets are matched.
9. A lip language identification system based on a subspace sparse attention mechanism, comprising a processor and a memory connected to each other, characterized in that the processor is programmed or configured to perform the steps of the lip language identification method based on a subspace sparse attention mechanism as claimed in any one of claims 1 to 8.
10. A computer-readable storage medium for storing a computer-readable program or instructions, which when executed by a processor, is capable of implementing the steps in the method for identifying lip language based on a subspace sparse attention mechanism as claimed in any one of claims 1 to 8.
CN202211518304.8A 2022-11-30 2022-11-30 Lip language identification method, system and medium based on subspace sparse attention mechanism Pending CN115910065A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211518304.8A CN115910065A (en) 2022-11-30 2022-11-30 Lip language identification method, system and medium based on subspace sparse attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211518304.8A CN115910065A (en) 2022-11-30 2022-11-30 Lip language identification method, system and medium based on subspace sparse attention mechanism

Publications (1)

Publication Number Publication Date
CN115910065A true CN115910065A (en) 2023-04-04

Family

ID=86480372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211518304.8A Pending CN115910065A (en) 2022-11-30 2022-11-30 Lip language identification method, system and medium based on subspace sparse attention mechanism

Country Status (1)

Country Link
CN (1) CN115910065A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132682A (en) * 2024-05-07 2024-06-04 山东浪潮科学研究院有限公司 Large language model reasoning acceleration method and device based on sparse sliding window

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132682A (en) * 2024-05-07 2024-06-04 山东浪潮科学研究院有限公司 Large language model reasoning acceleration method and device based on sparse sliding window

Similar Documents

Publication Publication Date Title
US11210306B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
US11741109B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
CN111460807B (en) Sequence labeling method, device, computer equipment and storage medium
CN109887484B (en) Dual learning-based voice recognition and voice synthesis method and device
WO2023024412A1 (en) Visual question answering method and apparatus based on deep learning model, and medium and device
WO2021114840A1 (en) Scoring method and apparatus based on semantic analysis, terminal device, and storage medium
WO2023160472A1 (en) Model training method and related device
CN113707125B (en) Training method and device for multi-language speech synthesis model
CN110472548B (en) Video continuous sign language recognition method and system based on grammar classifier
US11961515B2 (en) Contrastive Siamese network for semi-supervised speech recognition
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
US20240185840A1 (en) Method of training natural language processing model method of natural language processing, and electronic device
KR20220130565A (en) Keyword detection method and apparatus thereof
CN114218945A (en) Entity identification method, device, server and storage medium
CN116304307A (en) Graph-text cross-modal retrieval network training method, application method and electronic equipment
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN115545041A (en) Model construction method and system for enhancing semantic vector representation of medical statement
CN115910065A (en) Lip language identification method, system and medium based on subspace sparse attention mechanism
CN117235605B (en) Sensitive information classification method and device based on multi-mode attention fusion
CN113257230A (en) Voice processing method and device and computer storage medium
CN116775873A (en) Multi-mode dialogue emotion recognition method
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN113792121B (en) Training method and device of reading and understanding model, reading and understanding method and device
CN114913871A (en) Target object classification method, system, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination