CN113642420B - Method, device and equipment for recognizing lip language - Google Patents
Method, device and equipment for recognizing lip language Download PDFInfo
- Publication number
- CN113642420B CN113642420B CN202110843573.0A CN202110843573A CN113642420B CN 113642420 B CN113642420 B CN 113642420B CN 202110843573 A CN202110843573 A CN 202110843573A CN 113642420 B CN113642420 B CN 113642420B
- Authority
- CN
- China
- Prior art keywords
- lip
- sequence
- image sequence
- coding
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000001514 detection method Methods 0.000 claims abstract description 68
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 13
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 12
- 230000008859 change Effects 0.000 claims abstract description 8
- 230000009466 transformation Effects 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 30
- 238000004590 computer program Methods 0.000 claims description 17
- 230000007246 mechanism Effects 0.000 claims description 17
- 238000011176 pooling Methods 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 abstract description 5
- 210000005266 circulating tumour cell Anatomy 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 210000004027 cell Anatomy 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 210000001097 facial muscle Anatomy 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a method, a device and equipment for recognizing lip language, comprising the steps of obtaining video data and processing the video data to obtain a lip image sequence; performing bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence; invoking an LSTM model, performing boundary detection of shot transformation on the change of the apparent characteristics, generating a detection result, and initializing an implicit layer and a memory layer of the LSTM model according to the detection result; and extracting the coding characteristics of the hidden layer, and acquiring a word prediction sequence according to the coding characteristics. The complexity and time complexity of the existing lip language recognition algorithm are reduced, and meanwhile, the higher accuracy is maintained.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to a lip language identification method, a lip language identification device and lip language identification equipment.
Background
With the rapid development of computer technology, internet industry and the like, the development of artificial intelligence has entered a new stage. With the achievement of deep learning in the fields of computer vision, natural language processing and the like, a lip language recognition technology based on deep learning becomes a research hotspot.
Lip recognition refers to the process of understanding information expressed by visual information (including facial muscles, lip movements, tongue, etc.), and has very important application value in real life.
Today's lip recognition is directed to short video data that lasts 1-3 seconds and has no shot transitions, so recognition is relatively simple, but most in real life is long video and has a contextual relationship between the different shots in the video. The existing recognition method needs to consume much time and hardware resources for lip language recognition aiming at long video data.
In view of this, the present application is presented.
Disclosure of Invention
The invention discloses a method, a device and equipment for recognizing a lip language, which aim to reduce the complexity and time complexity of the existing lip language recognition algorithm and simultaneously maintain higher accuracy.
The first embodiment of the invention provides a method for identifying a lip language, which comprises the following steps:
acquiring video data and processing the video data to obtain a lip image sequence;
performing bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence;
invoking an LSTM model, performing boundary detection of shot transformation on the change of the apparent characteristics, generating a detection result, and initializing an implicit layer and a memory layer of the LSTM model according to the detection result;
and extracting the coding characteristics of the hidden layer, and acquiring a word prediction sequence according to the coding characteristics.
Preferably, the acquiring video data and processing the video data to obtain the lip sequence image specifically includes:
performing data frame cutting on the video data to generate an image sequence;
a face detection model is called, face detection is carried out on the image sequence, and target point information is obtained, wherein the target point information comprises face calibration information and key point information;
and cutting the image sequence according to the target point information to obtain a lip image sequence.
Preferably, the feature extraction of the bidirectional time sequence is performed on the image sequence, and the apparent features for generating the lip image sequence are specifically:
calling a three-dimensional convolution network, and taking a lip image sequence in a preset format as an input of the three-dimensional convolution network;
and obtaining tensors output by the three-dimensional convolution network, and carrying out adaptive average pooling on the tensors in the space dimension to generate the apparent characteristics.
Preferably, the invoking the LSTM model performs boundary detection of shot transformation on the change of the apparent feature and generates a detection result, and initializing an hidden layer and a memory layer of the LSTM model according to the detection result specifically includes:
the apparent characteristics and the hidden layer at the previous moment are linearly combined and then input into a boundary detection function;
invoking a sigmoid function to activate the boundary detection function so as to generate a detection result;
judging whether the current video and the previous and next frames are in a video segment or not according to the detection result;
if yes, initializing an implicit layer and a memory layer of the LSTM model.
Preferably, the extracting the coding feature of the hidden layer and obtaining the word prediction sequence according to the coding feature specifically includes:
extracting forward coding features and reverse coding features of the hidden layer;
invoking an attention mechanism to the forward coding feature and the backward coding feature to obtain a feature vector of the apparent feature;
and calling the feature vector of the loss function to perform sequence decoding so as to obtain a predicted sequence of the word.
A second embodiment of the present invention provides a device for recognizing a lip language, including:
the lip image sequence acquisition unit is used for acquiring video data and processing the video data to acquire a lip image sequence;
the apparent feature generating unit is used for carrying out bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence;
the initialization unit is used for calling an LSTM model, performing boundary detection of shot transformation on the change of the apparent characteristics, generating a detection result, and initializing an implicit layer and a memory layer of the LSTM model according to the detection result;
and the word prediction sequence acquisition unit is used for extracting the coding characteristics of the hidden layer and acquiring the word prediction sequence according to the coding characteristics.
Preferably, the lip image sequence acquiring unit is specifically configured to:
performing data frame cutting on the video data to generate an image sequence;
a face detection model is called, face detection is carried out on the image sequence, and target point information is obtained, wherein the target point information comprises face calibration information and key point information;
and cutting the image sequence according to the target point information to obtain a lip image sequence.
Preferably, the apparent feature generation unit is specifically configured to:
calling a three-dimensional convolution network, and taking a lip image sequence in a preset format as an input of the three-dimensional convolution network;
and obtaining tensors output by the three-dimensional convolution network, and carrying out adaptive average pooling on the tensors in the space dimension to generate the apparent characteristics.
Preferably, the initialization unit is specifically configured to:
the apparent characteristics and the hidden layer at the previous moment are linearly combined and then input into a boundary detection function;
invoking a sigmoid function to activate the boundary detection function so as to generate a detection result;
judging whether the current video and the previous and next frames are in a video segment or not according to the detection result;
if yes, initializing an implicit layer and a memory layer of the LSTM model.
A third embodiment of the present invention provides a device for identifying a lip language, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor executing the computer program to implement a method for identifying a lip language according to any one of the above.
According to the lip language identification method, the device and the equipment provided by the invention, the lip image sequence is obtained by processing the long video data, and the forward sequence feature extraction and the reverse sequence feature extraction are carried out on the lip image sequence, so that the bidirectional lip sequence can better excavate the time sequence information in the sequence, the forward time sequence information can capture global information, and the reverse time sequence can pay attention to very important local information, thereby playing a role in extracting key information. In addition, after the feature codes of the forward direction and the reverse direction are obtained, the features with key information are further selected by combining an attention mechanism. The algorithm complexity is reduced, and the time complexity is kept at a higher accuracy.
Drawings
Fig. 1 is a schematic flow chart of a method for identifying a lip according to a first embodiment of the present invention;
fig. 2 is a schematic diagram of a processing procedure for identifying a lip language according to the present invention;
fig. 3 is a schematic block diagram of a device for identifying a lip according to a second embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
For a better understanding of the technical solution of the present invention, the following detailed description of the embodiments of the present invention refers to the accompanying drawings.
It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.
References to "first\second" in the embodiments are merely to distinguish similar objects and do not represent a particular ordering for the objects, it being understood that "first\second" may interchange a particular order or precedence where allowed. It is to be understood that the "first\second" distinguishing objects may be interchanged where appropriate to enable the embodiments described herein to be implemented in sequences other than those illustrated or described herein.
Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The invention discloses a method, a device and equipment for recognizing a lip language, which aim to reduce the complexity and time complexity of the existing lip language recognition algorithm and simultaneously maintain higher accuracy.
Referring to fig. 1 and 2, a first embodiment of the present invention provides a method for recognizing a lip language, which may be performed by a device for recognizing a lip language (hereinafter referred to as a recognition device), and in particular, one or more processors in the recognition device, so as to implement the following steps:
s101, acquiring video data, and processing the video data to obtain a lip image sequence;
in this embodiment, the recognition device may be a terminal device (such as a smart phone, a smart printer or other smart devices), which may be configured with an image capturing device for capturing video data, and in particular, the recognition device may store therein data for performing a lip language recognition method, for recognizing a lip language according to the video data.
In this embodiment, performing data frame slicing on the video data to generate an image sequence; wherein the video data may be long video data.
A face detection model is called, face detection is carried out on the image sequence, and target point information is obtained, wherein the target point information comprises face calibration information and key point information; the face detection model may be MTCNN.
And cutting the image sequence according to the target point information to obtain a lip image sequence.
S102, carrying out bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence;
in the embodiment, a three-dimensional convolution network is called, and a lip image sequence in a preset format is used as input of the three-dimensional convolution network;
and obtaining tensors output by the three-dimensional convolution network, and carrying out adaptive average pooling on the tensors in the space dimension to generate the apparent characteristics.
The feature coding network for the forward and reverse lip sequences may be a three-dimensional convolution network, the inputs of which areIs output as +.>Tensors of size (Tensor). The three-dimensional convolution network can be structured by 64 +.>Three-dimensional convolution kernels of size, the convolution layer is followed by batch normalization (Batch Normalization, BN) and correction linear units (Rectified Linear Units, reLU). The tensor output by the method is subjected to self-adaptive average pooling in the space dimension to generate 1024-dimensional apparent characteristics.
It should be noted that, the feature extraction process of the lip image sequence is divided into forward sequence feature extraction and reverse sequence feature extraction, and the main purposes are two, namely, firstly, the time sequence information in the sequence can be better mined through the bidirectional lip sequence, secondly, the forward time sequence information can capture global information, and the reverse time sequence can pay attention to very important local information, so as to play a role in extracting key information.
S103, invoking an LSTM model, performing boundary detection of shot transformation on the change of the apparent characteristics, generating a detection result, and initializing an hidden layer and a memory layer of the LSTM model according to the detection result;
specifically: the apparent characteristics and the hidden layer at the previous moment are linearly combined and then input into a boundary detection function;
invoking a sigmoid function to activate the boundary detection function so as to generate a detection result;
judging whether the current video and the previous and next frames are in a video segment or not according to the detection result;
if yes, initializing an implicit layer and a memory layer of the LSTM model.
Note that, in the LSTM model, the memory cellThe history information to the current time is kept and the door is inputControl whether the current input is added into the memory cell, forget about the door +.>Memory unit for controlling whether to forget last momentThe output gate controls whether to output the information of the current memory cell.
At each time step, a decision is made as to whether to pass the contents of the hidden layer and memory unit to the next time instant to initialize the new unit, depending on a time boundary detection unit that allows our encoder to independently process variable length blocks of the input video. The boundary of each block is given by a learnable function that depends on the input, rather than being preset.
Value of boundary detection functionBy the current input +.>Namely the lip sequence features and the hidden layer +.>Linear binding followed by activation using the sigmoid function, noted +.>。
Wherein,is a learnable row vector,/>And->Is a weight that can be learned, +.>Is the bias value.
If it isIf the value of (1) is 1, it means that the shot transition is detected, and no hidden layer is performed +.>And memory cell->Is updated according to the update of the update program. If->If the value of (2) is 0, it means that the current video and the previous and subsequent frames are in the same video segment, then the hidden layer +.>And memory cell->Is updated according to the update of the update program. According to boundary detector->Output value of (2) in memory cell->Before the update, i.e. when a new video clip starts, the network hidden layer and memory unit are transferred or reinitialized using the following equations:
wherein,is an implicit layer unit of the last moment, +.>Is the memory cell of the previous moment.
S104, extracting the coding features of the hidden layer, and acquiring word prediction sequences according to the coding features.
Specifically: extracting forward coding features and reverse coding features of the hidden layer;
invoking an attention mechanism to the forward coding feature and the backward coding feature to obtain a feature vector of the apparent feature;
and calling the feature vector of the loss function to perform sequence decoding so as to obtain a predicted sequence of the word.
After the characteristics of the forward lip sequence and the reverse lip sequence are subjected to boundary detection, each LSTM unit maintains the current coding information, and the hidden layer unit is extracted as the forward coding characteristicAnd reverse coding feature->Attention mechanisms will then be employed to pick out features with critical information.
Features after forward encodingThrough the attentiveness mechanism, the following is adopted: ,
wherein,,/>,/>,/>are all learnable parameters, +.>For relative score, < >>Is the attention weight.
Likewise, the features after the reverse coding are to be performedThrough the attentiveness mechanism, the following is adopted:
wherein,,/>,/>,/>are all learnable parameters, +.>For relative score, < >>Is the attention weight.
In the process of obtaining the feature vectorThe results of each frame then need to be classified. The classification is performed using full connectivity and SoftMax in this embodiment. In the training phase, CTC loss is used for training, and the calculation formula of the CTC loss function is as follows:
where T is the length of the input sequence,output label +.>SoftMax probability of (c), and at time t,;/>is a sequence of CTC paths, +.>Is a true value (i.e. tag), +.>The representation can be mapped to the tag->Is a set of CTC paths.
In the test phase, the decoding of the sequence was performed using prefix beam decode of CTCs, resulting in a word predicted sequence.
Compared with the prior art, the embodiment has a plurality of advantages and beneficial effects, and is specifically embodied in the following aspects:
the embodiment is used for researching the problem of lip language identification in long videos, and is more close to practical application.
The boundary detector is designed to recognize shot conversion in the long video, so that the long video can be segmented, and lip language recognition is performed on the segmented video segments.
The embodiment builds a bidirectional time sequence diagram combining an attention mechanism, extracts key video frames through a forward video frame time sequence and a reverse video frame time sequence and the attention mechanism, thereby removing redundancy in video, extracting features with discriminant power, and facilitating lip language identification.
Referring to fig. 3, a second embodiment of the present invention provides a device for identifying a lip language, including:
a lip image sequence acquisition unit 201, configured to acquire video data, and process the video data to obtain a lip image sequence;
an apparent feature generating unit 202, configured to perform feature extraction of bidirectional time sequence on the image sequence, and generate apparent features of the lip image sequence;
an initialization unit 203, configured to invoke an LSTM model, perform boundary detection of shot transformation on the change of the apparent feature, generate a detection result, and initialize an implicit layer and a memory layer of the LSTM model according to the detection result;
a word prediction sequence obtaining unit 204, configured to extract coding features of the hidden layer, and obtain a word prediction sequence according to the coding features.
Preferably, the lip image sequence acquiring unit is specifically configured to:
performing data frame cutting on the video data to generate an image sequence;
a face detection model is called, face detection is carried out on the image sequence, and target point information is obtained, wherein the target point information comprises face calibration information and key point information;
and cutting the image sequence according to the target point information to obtain a lip image sequence.
Preferably, the apparent feature generation unit is specifically configured to:
calling a three-dimensional convolution network, and taking a lip image sequence in a preset format as an input of the three-dimensional convolution network;
and obtaining tensors output by the three-dimensional convolution network, and carrying out adaptive average pooling on the tensors in the space dimension to generate the apparent characteristics.
Preferably, the initialization unit is specifically configured to:
the apparent characteristics and the hidden layer at the previous moment are linearly combined and then input into a boundary detection function;
invoking a sigmoid function to activate the boundary detection function so as to generate a detection result;
judging whether the current video and the previous and next frames are in a video segment or not according to the detection result;
if yes, initializing an implicit layer and a memory layer of the LSTM model.
A third embodiment of the present invention provides a device for identifying a lip language, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor executing the computer program to implement a method for identifying a lip language according to any one of the above.
According to the lip language identification method, the device and the equipment provided by the invention, the lip image sequence is obtained by processing the long video data, and the forward sequence feature extraction and the reverse sequence feature extraction are carried out on the lip image sequence, so that the bidirectional lip sequence can better excavate the time sequence information in the sequence, the forward time sequence information can capture global information, and the reverse time sequence can pay attention to very important local information, thereby playing a role in extracting key information. In addition, after the feature codes of the forward direction and the reverse direction are obtained, the features with key information are further selected by combining an attention mechanism. The algorithm complexity is reduced, and the time complexity is kept at a higher accuracy.
Illustratively, the computer programs described in the third and fourth embodiments of the present invention may be divided into one or more modules, which are stored in the memory and executed by the processor to complete the present invention. The one or more modules may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program in the identification device implementing a lip language. For example, the device described in the second embodiment of the present invention.
The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general processor may be a microprocessor or the processor may be any conventional processor, etc., and the processor is a control center of the method for identifying a lip language, and uses various interfaces and lines to connect various parts of the entire method for implementing the method for identifying a lip language.
The memory may be used to store the computer program and/or module, and the processor may implement various functions of a method for recognizing a lip language by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, a text conversion function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, text message data, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the modules may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. Based on this understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each method embodiment described above when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (5)
1. A method for recognizing a lip language, comprising:
acquiring video data and processing the video data to obtain a lip image sequence;
performing bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence, wherein the apparent features are specifically as follows: calling a three-dimensional convolution network, and taking a lip image sequence in a preset format as an input of the three-dimensional convolution network; acquiring tensors output by the three-dimensional convolution network, and carrying out adaptive average pooling on the tensors in space dimension to generate the apparent characteristics;
invoking an LSTM model, performing boundary detection of shot transformation on the change of the apparent characteristics, generating a detection result, and initializing an hidden layer and a memory layer of the LSTM model according to the detection result, wherein the method specifically comprises the following steps of: the apparent characteristics and the hidden layer at the previous moment are linearly combined and then input into a boundary detection function; invoking a sigmoid function to activate the boundary detection function so as to generate a detection result, judging whether the current video and the previous and next frames are in a video segment according to the detection result, if so, initializing an implicit layer and a memory layer of the LSTM model;
extracting coding features of the hidden layer, and acquiring word prediction sequences according to the coding features, specifically:
extracting forward coding features and reverse coding features of the hidden layer;
invoking an attention mechanism to the forward coding feature and the backward coding feature to obtain a feature vector of the apparent feature;
invoking the feature vector of the loss function to perform sequence decoding so as to obtain a predicted sequence of the word;
after the characteristics of the forward lip sequence and the reverse lip sequence are subjected to boundary detection, each LSTM unit maintains the current coding information, and the hidden layer unit is extracted as the forward coding characteristicAnd reverse coding feature->Attention mechanisms are then employed to pick out features with critical information;
features after forward encodingThrough the attentiveness mechanism, the following is adopted:
wherein,,/>,/>,/>are all learnable parameters, +.>For relative score, < >>For attention weight, ++>Is an implicit layer unit of the last moment;
likewise, the features after the reverse coding are to be performedThrough the attentiveness mechanism, the following is adopted:
wherein,,/>,/>,/>are all learnable parameters, +.>For relative score, < >>For attention weight, ++>Is the hidden layer unit of the last time.
2. The method for recognizing a lip according to claim 1, wherein the obtaining video data and processing the video data to obtain the lip sequence image specifically comprises:
performing data frame cutting on the video data to generate an image sequence;
a face detection model is called, face detection is carried out on the image sequence, and target point information is obtained, wherein the target point information comprises face calibration information and key point information;
and cutting the image sequence according to the target point information to obtain a lip image sequence.
3. A lip language recognition apparatus, comprising:
the lip image sequence acquisition unit is used for acquiring video data and processing the video data to acquire a lip image sequence;
the apparent feature generating unit is used for carrying out bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence, and is specifically used for: calling a three-dimensional convolution network, and taking a lip image sequence in a preset format as an input of the three-dimensional convolution network; acquiring tensors output by the three-dimensional convolution network, and carrying out adaptive average pooling on the tensors in space dimension to generate the apparent characteristics;
the initialization unit is used for calling the LSTM model, carrying out boundary detection of shot transformation on the change of the apparent characteristics and generating a detection result, and initializing an hidden layer and a memory layer of the LSTM model according to the detection result, wherein the initialization unit is specifically used for: the apparent characteristics and the hidden layer at the previous moment are linearly combined and then input into a boundary detection function; invoking a sigmoid function to activate the boundary detection function so as to generate a detection result, judging whether the current video and the previous and next frames are in a video segment according to the detection result, if so, initializing an implicit layer and a memory layer of the LSTM model;
the word prediction sequence obtaining unit is used for extracting the coding characteristics of the hidden layer and obtaining a word prediction sequence according to the coding characteristics, and specifically:
extracting forward coding features and reverse coding features of the hidden layer;
invoking an attention mechanism to the forward coding feature and the backward coding feature to obtain a feature vector of the apparent feature;
invoking the feature vector of the loss function to perform sequence decoding so as to obtain a predicted sequence of the word;
it should be noted thatAfter the characteristics of the forward lip sequence and the reverse lip sequence are subjected to boundary detection, each LSTM unit keeps the current coding information, and the hidden layer unit is extracted to be used as the forward coding characteristicAnd reverse coding feature->Attention mechanisms are then employed to pick out features with critical information;
features after forward encodingThrough the attentiveness mechanism, the following is adopted:
wherein,,/>,/>,/>are all learnable parameters, +.>For relative score, < >>For attention weight, ++>Is an implicit layer unit of the last moment;
likewise, the features after the reverse coding are to be performedThrough the attentiveness mechanism, the following is adopted:
wherein,,/>,/>,/>are all learnable parameters, +.>For relative score, < >>For attention weight, ++>Is the hidden layer unit of the last time.
4. A lip language recognition apparatus according to claim 3, wherein the lip image sequence obtaining unit is specifically configured to:
performing data frame cutting on the video data to generate an image sequence;
a face detection model is called, face detection is carried out on the image sequence, and target point information is obtained, wherein the target point information comprises face calibration information and key point information;
and cutting the image sequence according to the target point information to obtain a lip image sequence.
5. A device for identifying a lip language, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor executing the computer program to implement a method for identifying a lip language according to any one of claims 1 to 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110843573.0A CN113642420B (en) | 2021-07-26 | 2021-07-26 | Method, device and equipment for recognizing lip language |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110843573.0A CN113642420B (en) | 2021-07-26 | 2021-07-26 | Method, device and equipment for recognizing lip language |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113642420A CN113642420A (en) | 2021-11-12 |
CN113642420B true CN113642420B (en) | 2024-04-16 |
Family
ID=78418314
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110843573.0A Active CN113642420B (en) | 2021-07-26 | 2021-07-26 | Method, device and equipment for recognizing lip language |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113642420B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409195A (en) * | 2018-08-30 | 2019-03-01 | 华侨大学 | A kind of lip reading recognition methods neural network based and system |
CN110443129A (en) * | 2019-06-30 | 2019-11-12 | 厦门知晓物联技术服务有限公司 | Chinese lip reading recognition methods based on deep learning |
CN110633683A (en) * | 2019-09-19 | 2019-12-31 | 华侨大学 | Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM |
CN111178157A (en) * | 2019-12-10 | 2020-05-19 | 浙江大学 | Chinese lip language identification method from cascade sequence to sequence model based on tone |
-
2021
- 2021-07-26 CN CN202110843573.0A patent/CN113642420B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409195A (en) * | 2018-08-30 | 2019-03-01 | 华侨大学 | A kind of lip reading recognition methods neural network based and system |
CN110443129A (en) * | 2019-06-30 | 2019-11-12 | 厦门知晓物联技术服务有限公司 | Chinese lip reading recognition methods based on deep learning |
CN110633683A (en) * | 2019-09-19 | 2019-12-31 | 华侨大学 | Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM |
CN111178157A (en) * | 2019-12-10 | 2020-05-19 | 浙江大学 | Chinese lip language identification method from cascade sequence to sequence model based on tone |
Non-Patent Citations (1)
Title |
---|
视频人物社交关系抽取的关键技术研究;吕金娜;《中国博士学位论文全文数据库信息科技辑》(第8期);61-74 * |
Also Published As
Publication number | Publication date |
---|---|
CN113642420A (en) | 2021-11-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111164601B (en) | Emotion recognition method, intelligent device and computer readable storage medium | |
CN110188829B (en) | Neural network training method, target recognition method and related products | |
CN109117777B (en) | Method and device for generating information | |
CN110119757B (en) | Model training method, video category detection method, device, electronic equipment and computer readable medium | |
CN110602527A (en) | Video processing method, device and storage medium | |
CN109413510B (en) | Video abstract generation method and device, electronic equipment and computer storage medium | |
CN111914076B (en) | User image construction method, system, terminal and storage medium based on man-machine conversation | |
KR20170026222A (en) | Method and device for classifying an object of an image and corresponding computer program product and computer-readable medium | |
Qiao et al. | Hidden markov model based dynamic texture classification | |
CN109389076B (en) | Image segmentation method and device | |
CN111108508B (en) | Face emotion recognition method, intelligent device and computer readable storage medium | |
CN112804558B (en) | Video splitting method, device and equipment | |
CN114239717A (en) | Model training method, image processing method and device, electronic device and medium | |
CN114723646A (en) | Image data generation method with label, device, storage medium and electronic equipment | |
CN111986204A (en) | Polyp segmentation method and device and storage medium | |
CN111652878B (en) | Image detection method, image detection device, computer equipment and storage medium | |
CN113689527B (en) | Training method of face conversion model and face image conversion method | |
CN113642420B (en) | Method, device and equipment for recognizing lip language | |
CN117218346A (en) | Image generation method, device, computer readable storage medium and computer equipment | |
CN116844006A (en) | Target identification method and device, electronic equipment and readable storage medium | |
CN112487903B (en) | Gait data generation method and device based on countermeasure network | |
CN114495230A (en) | Expression recognition method and device, electronic equipment and storage medium | |
CN112989869A (en) | Optimization method, device and equipment of face quality detection model and storage medium | |
CN112115740A (en) | Method and apparatus for processing image | |
CN116049446B (en) | Event extraction method, device, equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |