CN113642420B

CN113642420B - Method, device and equipment for recognizing lip language

Info

Publication number: CN113642420B
Application number: CN202110843573.0A
Authority: CN
Inventors: 杜吉祥; 汪冠鸿; 张洪博; 彭肖肖; 翟传敏
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2024-04-16
Anticipated expiration: 2041-07-26
Also published as: CN113642420A

Abstract

The invention provides a method, a device and equipment for recognizing lip language, comprising the steps of obtaining video data and processing the video data to obtain a lip image sequence; performing bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence; invoking an LSTM model, performing boundary detection of shot transformation on the change of the apparent characteristics, generating a detection result, and initializing an implicit layer and a memory layer of the LSTM model according to the detection result; and extracting the coding characteristics of the hidden layer, and acquiring a word prediction sequence according to the coding characteristics. The complexity and time complexity of the existing lip language recognition algorithm are reduced, and meanwhile, the higher accuracy is maintained.

Description

Method, device and equipment for recognizing lip language

Technical Field

The invention relates to the field of artificial intelligence, in particular to a lip language identification method, a lip language identification device and lip language identification equipment.

Background

With the rapid development of computer technology, internet industry and the like, the development of artificial intelligence has entered a new stage. With the achievement of deep learning in the fields of computer vision, natural language processing and the like, a lip language recognition technology based on deep learning becomes a research hotspot.

Lip recognition refers to the process of understanding information expressed by visual information (including facial muscles, lip movements, tongue, etc.), and has very important application value in real life.

Today's lip recognition is directed to short video data that lasts 1-3 seconds and has no shot transitions, so recognition is relatively simple, but most in real life is long video and has a contextual relationship between the different shots in the video. The existing recognition method needs to consume much time and hardware resources for lip language recognition aiming at long video data.

In view of this, the present application is presented.

Disclosure of Invention

The invention discloses a method, a device and equipment for recognizing a lip language, which aim to reduce the complexity and time complexity of the existing lip language recognition algorithm and simultaneously maintain higher accuracy.

The first embodiment of the invention provides a method for identifying a lip language, which comprises the following steps:

acquiring video data and processing the video data to obtain a lip image sequence;

performing bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence;

invoking an LSTM model, performing boundary detection of shot transformation on the change of the apparent characteristics, generating a detection result, and initializing an implicit layer and a memory layer of the LSTM model according to the detection result;

and extracting the coding characteristics of the hidden layer, and acquiring a word prediction sequence according to the coding characteristics.

Preferably, the acquiring video data and processing the video data to obtain the lip sequence image specifically includes:

performing data frame cutting on the video data to generate an image sequence;

a face detection model is called, face detection is carried out on the image sequence, and target point information is obtained, wherein the target point information comprises face calibration information and key point information;

and cutting the image sequence according to the target point information to obtain a lip image sequence.

Preferably, the feature extraction of the bidirectional time sequence is performed on the image sequence, and the apparent features for generating the lip image sequence are specifically:

calling a three-dimensional convolution network, and taking a lip image sequence in a preset format as an input of the three-dimensional convolution network;

and obtaining tensors output by the three-dimensional convolution network, and carrying out adaptive average pooling on the tensors in the space dimension to generate the apparent characteristics.

Preferably, the invoking the LSTM model performs boundary detection of shot transformation on the change of the apparent feature and generates a detection result, and initializing an hidden layer and a memory layer of the LSTM model according to the detection result specifically includes:

the apparent characteristics and the hidden layer at the previous moment are linearly combined and then input into a boundary detection function;

invoking a sigmoid function to activate the boundary detection function so as to generate a detection result;

judging whether the current video and the previous and next frames are in a video segment or not according to the detection result;

if yes, initializing an implicit layer and a memory layer of the LSTM model.

Preferably, the extracting the coding feature of the hidden layer and obtaining the word prediction sequence according to the coding feature specifically includes:

extracting forward coding features and reverse coding features of the hidden layer;

invoking an attention mechanism to the forward coding feature and the backward coding feature to obtain a feature vector of the apparent feature;

and calling the feature vector of the loss function to perform sequence decoding so as to obtain a predicted sequence of the word.

A second embodiment of the present invention provides a device for recognizing a lip language, including:

the lip image sequence acquisition unit is used for acquiring video data and processing the video data to acquire a lip image sequence;

the apparent feature generating unit is used for carrying out bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence;

the initialization unit is used for calling an LSTM model, performing boundary detection of shot transformation on the change of the apparent characteristics, generating a detection result, and initializing an implicit layer and a memory layer of the LSTM model according to the detection result;

and the word prediction sequence acquisition unit is used for extracting the coding characteristics of the hidden layer and acquiring the word prediction sequence according to the coding characteristics.

Preferably, the lip image sequence acquiring unit is specifically configured to:

performing data frame cutting on the video data to generate an image sequence;

Preferably, the apparent feature generation unit is specifically configured to:

Preferably, the initialization unit is specifically configured to:

if yes, initializing an implicit layer and a memory layer of the LSTM model.

A third embodiment of the present invention provides a device for identifying a lip language, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor executing the computer program to implement a method for identifying a lip language according to any one of the above.

According to the lip language identification method, the device and the equipment provided by the invention, the lip image sequence is obtained by processing the long video data, and the forward sequence feature extraction and the reverse sequence feature extraction are carried out on the lip image sequence, so that the bidirectional lip sequence can better excavate the time sequence information in the sequence, the forward time sequence information can capture global information, and the reverse time sequence can pay attention to very important local information, thereby playing a role in extracting key information. In addition, after the feature codes of the forward direction and the reverse direction are obtained, the features with key information are further selected by combining an attention mechanism. The algorithm complexity is reduced, and the time complexity is kept at a higher accuracy.

Drawings

Fig. 1 is a schematic flow chart of a method for identifying a lip according to a first embodiment of the present invention;

fig. 2 is a schematic diagram of a processing procedure for identifying a lip language according to the present invention;

fig. 3 is a schematic block diagram of a device for identifying a lip according to a second embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For a better understanding of the technical solution of the present invention, the following detailed description of the embodiments of the present invention refers to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

References to "first\second" in the embodiments are merely to distinguish similar objects and do not represent a particular ordering for the objects, it being understood that "first\second" may interchange a particular order or precedence where allowed. It is to be understood that the "first\second" distinguishing objects may be interchanged where appropriate to enable the embodiments described herein to be implemented in sequences other than those illustrated or described herein.

Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a first embodiment of the present invention provides a method for recognizing a lip language, which may be performed by a device for recognizing a lip language (hereinafter referred to as a recognition device), and in particular, one or more processors in the recognition device, so as to implement the following steps:

s101, acquiring video data, and processing the video data to obtain a lip image sequence;

in this embodiment, the recognition device may be a terminal device (such as a smart phone, a smart printer or other smart devices), which may be configured with an image capturing device for capturing video data, and in particular, the recognition device may store therein data for performing a lip language recognition method, for recognizing a lip language according to the video data.

In this embodiment, performing data frame slicing on the video data to generate an image sequence; wherein the video data may be long video data.

A face detection model is called, face detection is carried out on the image sequence, and target point information is obtained, wherein the target point information comprises face calibration information and key point information; the face detection model may be MTCNN.

S102, carrying out bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence;

in the embodiment, a three-dimensional convolution network is called, and a lip image sequence in a preset format is used as input of the three-dimensional convolution network;

The feature coding network for the forward and reverse lip sequences may be a three-dimensional convolution network, the inputs of which areIs output as +.>Tensors of size (Tensor). The three-dimensional convolution network can be structured by 64 +.>Three-dimensional convolution kernels of size, the convolution layer is followed by batch normalization (Batch Normalization, BN) and correction linear units (Rectified Linear Units, reLU). The tensor output by the method is subjected to self-adaptive average pooling in the space dimension to generate 1024-dimensional apparent characteristics.

It should be noted that, the feature extraction process of the lip image sequence is divided into forward sequence feature extraction and reverse sequence feature extraction, and the main purposes are two, namely, firstly, the time sequence information in the sequence can be better mined through the bidirectional lip sequence, secondly, the forward time sequence information can capture global information, and the reverse time sequence can pay attention to very important local information, so as to play a role in extracting key information.

S103, invoking an LSTM model, performing boundary detection of shot transformation on the change of the apparent characteristics, generating a detection result, and initializing an hidden layer and a memory layer of the LSTM model according to the detection result;

specifically: the apparent characteristics and the hidden layer at the previous moment are linearly combined and then input into a boundary detection function;

if yes, initializing an implicit layer and a memory layer of the LSTM model.

Note that, in the LSTM model, the memory cellThe history information to the current time is kept and the door is inputControl whether the current input is added into the memory cell, forget about the door +.>Memory unit for controlling whether to forget last momentThe output gate controls whether to output the information of the current memory cell.

At each time step, a decision is made as to whether to pass the contents of the hidden layer and memory unit to the next time instant to initialize the new unit, depending on a time boundary detection unit that allows our encoder to independently process variable length blocks of the input video. The boundary of each block is given by a learnable function that depends on the input, rather than being preset.

Value of boundary detection functionBy the current input +.>Namely the lip sequence features and the hidden layer +.>Linear binding followed by activation using the sigmoid function, noted +.>。

Wherein,is a learnable row vector，/>And->Is a weight that can be learned, +.>Is the bias value.

If it isIf the value of (1) is 1, it means that the shot transition is detected, and no hidden layer is performed +.>And memory cell->Is updated according to the update of the update program. If->If the value of (2) is 0, it means that the current video and the previous and subsequent frames are in the same video segment, then the hidden layer +.>And memory cell->Is updated according to the update of the update program. According to boundary detector->Output value of (2) in memory cell->Before the update, i.e. when a new video clip starts, the network hidden layer and memory unit are transferred or reinitialized using the following equations:

wherein,is an implicit layer unit of the last moment, +.>Is the memory cell of the previous moment.

S104, extracting the coding features of the hidden layer, and acquiring word prediction sequences according to the coding features.

Specifically: extracting forward coding features and reverse coding features of the hidden layer;

After the characteristics of the forward lip sequence and the reverse lip sequence are subjected to boundary detection, each LSTM unit maintains the current coding information, and the hidden layer unit is extracted as the forward coding characteristicAnd reverse coding feature->Attention mechanisms will then be employed to pick out features with critical information.

Features after forward encodingThrough the attentiveness mechanism, the following is adopted: ,

wherein,，/>，/>，/>are all learnable parameters, +.>For relative score, < >>Is the attention weight.

Likewise, the features after the reverse coding are to be performedThrough the attentiveness mechanism, the following is adopted:

In the process of obtaining the feature vectorThe results of each frame then need to be classified. The classification is performed using full connectivity and SoftMax in this embodiment. In the training phase, CTC loss is used for training, and the calculation formula of the CTC loss function is as follows:

where T is the length of the input sequence,output label +.>SoftMax probability of (c), and at time t,；/>is a sequence of CTC paths, +.>Is a true value (i.e. tag), +.>The representation can be mapped to the tag->Is a set of CTC paths.

In the test phase, the decoding of the sequence was performed using prefix beam decode of CTCs, resulting in a word predicted sequence.

Compared with the prior art, the embodiment has a plurality of advantages and beneficial effects, and is specifically embodied in the following aspects:

the embodiment is used for researching the problem of lip language identification in long videos, and is more close to practical application.

The boundary detector is designed to recognize shot conversion in the long video, so that the long video can be segmented, and lip language recognition is performed on the segmented video segments.

The embodiment builds a bidirectional time sequence diagram combining an attention mechanism, extracts key video frames through a forward video frame time sequence and a reverse video frame time sequence and the attention mechanism, thereby removing redundancy in video, extracting features with discriminant power, and facilitating lip language identification.

Referring to fig. 3, a second embodiment of the present invention provides a device for identifying a lip language, including:

a lip image sequence acquisition unit 201, configured to acquire video data, and process the video data to obtain a lip image sequence;

an apparent feature generating unit 202, configured to perform feature extraction of bidirectional time sequence on the image sequence, and generate apparent features of the lip image sequence;

an initialization unit 203, configured to invoke an LSTM model, perform boundary detection of shot transformation on the change of the apparent feature, generate a detection result, and initialize an implicit layer and a memory layer of the LSTM model according to the detection result;

a word prediction sequence obtaining unit 204, configured to extract coding features of the hidden layer, and obtain a word prediction sequence according to the coding features.

performing data frame cutting on the video data to generate an image sequence;

Preferably, the apparent feature generation unit is specifically configured to:

Preferably, the initialization unit is specifically configured to:

if yes, initializing an implicit layer and a memory layer of the LSTM model.

Illustratively, the computer programs described in the third and fourth embodiments of the present invention may be divided into one or more modules, which are stored in the memory and executed by the processor to complete the present invention. The one or more modules may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program in the identification device implementing a lip language. For example, the device described in the second embodiment of the present invention.

The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general processor may be a microprocessor or the processor may be any conventional processor, etc., and the processor is a control center of the method for identifying a lip language, and uses various interfaces and lines to connect various parts of the entire method for implementing the method for identifying a lip language.

The memory may be used to store the computer program and/or module, and the processor may implement various functions of a method for recognizing a lip language by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, a text conversion function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, text message data, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

Wherein the modules may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. Based on this understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each method embodiment described above when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method for recognizing a lip language, comprising:

performing bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence, wherein the apparent features are specifically as follows: calling a three-dimensional convolution network, and taking a lip image sequence in a preset format as an input of the three-dimensional convolution network; acquiring tensors output by the three-dimensional convolution network, and carrying out adaptive average pooling on the tensors in space dimension to generate the apparent characteristics;

invoking an LSTM model, performing boundary detection of shot transformation on the change of the apparent characteristics, generating a detection result, and initializing an hidden layer and a memory layer of the LSTM model according to the detection result, wherein the method specifically comprises the following steps of: the apparent characteristics and the hidden layer at the previous moment are linearly combined and then input into a boundary detection function; invoking a sigmoid function to activate the boundary detection function so as to generate a detection result, judging whether the current video and the previous and next frames are in a video segment according to the detection result, if so, initializing an implicit layer and a memory layer of the LSTM model;

extracting coding features of the hidden layer, and acquiring word prediction sequences according to the coding features, specifically:

invoking the feature vector of the loss function to perform sequence decoding so as to obtain a predicted sequence of the word;

after the characteristics of the forward lip sequence and the reverse lip sequence are subjected to boundary detection, each LSTM unit maintains the current coding information, and the hidden layer unit is extracted as the forward coding characteristicAnd reverse coding feature->Attention mechanisms are then employed to pick out features with critical information;

features after forward encodingThrough the attentiveness mechanism, the following is adopted:

wherein,，/>，/>，/>are all learnable parameters, +.>For relative score, < >>For attention weight, ++>Is an implicit layer unit of the last moment;

wherein,，/>，/>，/>are all learnable parameters, +.>For relative score, < >>For attention weight, ++>Is the hidden layer unit of the last time.

2. The method for recognizing a lip according to claim 1, wherein the obtaining video data and processing the video data to obtain the lip sequence image specifically comprises:

performing data frame cutting on the video data to generate an image sequence;

3. A lip language recognition apparatus, comprising:

the apparent feature generating unit is used for carrying out bidirectional time sequence feature extraction on the image sequence to generate apparent features of the lip image sequence, and is specifically used for: calling a three-dimensional convolution network, and taking a lip image sequence in a preset format as an input of the three-dimensional convolution network; acquiring tensors output by the three-dimensional convolution network, and carrying out adaptive average pooling on the tensors in space dimension to generate the apparent characteristics;

the initialization unit is used for calling the LSTM model, carrying out boundary detection of shot transformation on the change of the apparent characteristics and generating a detection result, and initializing an hidden layer and a memory layer of the LSTM model according to the detection result, wherein the initialization unit is specifically used for: the apparent characteristics and the hidden layer at the previous moment are linearly combined and then input into a boundary detection function; invoking a sigmoid function to activate the boundary detection function so as to generate a detection result, judging whether the current video and the previous and next frames are in a video segment according to the detection result, if so, initializing an implicit layer and a memory layer of the LSTM model;

the word prediction sequence obtaining unit is used for extracting the coding characteristics of the hidden layer and obtaining a word prediction sequence according to the coding characteristics, and specifically:

it should be noted thatAfter the characteristics of the forward lip sequence and the reverse lip sequence are subjected to boundary detection, each LSTM unit keeps the current coding information, and the hidden layer unit is extracted to be used as the forward coding characteristicAnd reverse coding feature->Attention mechanisms are then employed to pick out features with critical information;

4. A lip language recognition apparatus according to claim 3, wherein the lip image sequence obtaining unit is specifically configured to:

performing data frame cutting on the video data to generate an image sequence;

5. A device for identifying a lip language, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor executing the computer program to implement a method for identifying a lip language according to any one of claims 1 to 2.