CN113658582B

CN113658582B - Lip language identification method and system for audio-visual collaboration

Info

Publication number: CN113658582B
Application number: CN202110800963.XA
Authority: CN
Inventors: 杨双; 罗明双; 山世光; 陈熙霖
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2024-05-07
Anticipated expiration: 2041-07-15
Also published as: CN113658582A

Abstract

The invention provides a lip language identification method and a lip language identification system for audio-visual coordination, which are based on measurement learning of three layers of vision-vision, audio-audio, vision-audio and the like, and the three measurement learning mechanisms are performed simultaneously, so that training time and stage are shortened, and collaborative learning between two modes of vision and audio can be performed better. With the help of the audio information, the visual model can extract the characteristic with more distinguishing degree, so that the performance of the lip language identification model is improved.

Description

Lip language identification method and system for audio-visual collaboration

Technical Field

The present invention relates to the field of speech recognition and computer vision, and in particular to visual speech recognition and lip language recognition.

Background

Lip recognition, also known as visual speech recognition, refers to a technique of interpreting the content of a utterance spoken by a speaker by looking at the actions of the face and lips of the speaker as it speaks. The technology can be used as a supplement of voice recognition based on audio, makes up the defect of a voice recognition model based on audio in environments such as high noise and the like, and can be independently used in environments such as silence and the like to realize efficient transmission of speaking voice content, so that the technology has great application value in a human-computer interaction system; meanwhile, in recent years, with the advent of large-scale lip language recognition data sets and the wide application of deep learning techniques in the fields of computer vision, natural language processing, and the like, lip language recognition has gradually gained widespread attention and gradually started to play a role in actual scenes.

The lip language identification in the actual scene generally faces the problem of low resources of a large dictionary, and is characterized in that when speaking contents are judged through visual information, the related target language contents cover a larger dictionary range, and the related vocabulary contents have a wider range. Therefore, the task of lip language recognition in the actual scene generally faces the challenges of difficult extraction of visual fine-granularity features, difficult modeling of time dimension, large gesture and expression change of a speaker, complex and changeable external illumination conditions and the like faced by the conventional task of lip language recognition on one hand, and faces the conditions of easy confusion among different words, more short words, difficult recognition and the like brought by a large dictionary unique to the task on the other hand.

At present, most lip language identification methods are visual single-mode strong supervision identification methods based on large-scale labeled data and deep neural network frameworks, and have better performance on some disclosed data sets, but still have many defects and limitations. On the one hand, the large-scale acquisition of the labeled data is a problem that a large amount of manpower and material resources are consumed, the accurate lip language identification data usually needs to be strictly time-stamped before and after, and a large amount of lip language identification data usually needs to be marked, so that the step is very tedious and difficult. On the other hand, when a person speaks, sound and lip movement are generated synchronously. Thus, how to use speech data to improve the performance of the lip recognition model is a valuable matter, but currently existing methods rarely use speech data to aid in vision-based learning of the lip recognition model.

When the research in the field of lip language identification is carried out, the lip language identification task is the same as other video classification, and the most important key point is to extract the most discriminative characteristic of each lip reading video. However, lip recognition faces challenges that are different from other video classification tasks.

First, the task of lip recognition is a very challenging task, we know that the lip area is very limited, and the corresponding speaking content is very diverse, and the objective environment in which the speaking process is performed is complex and variable, including external light, the gesture of the speaker, etc., and these factors make the lip recognition a very challenging task.

Secondly, labeling of the lip-identifying data is also a very cumbersome task, requiring not only strict start and stop time stamps, but also corresponding text labels. Most importantly, in the scene of a large dictionary, some visual picture samples of different word classes (such as 'WHICH' and 'WHILE') have obvious similarity, which puts higher demands on the discrimination capability of the model.

In addition, the same word class samples can also have large variability due to factors such as speaking speed, speaker appearance (including posture, age, make-up, personal habits, etc.). These problems further increase the difficulty of lip recognition because limited training data is difficult to cover all samples of different situations. The existing lip language identification methods are basically based on a large-scale lip language identification data set, and few consideration is given to how to obtain a model with better performance under the condition of low resources to process the lip language identification task.

At the same time, we note that not only does we speak about the lip motion, but also the audio that is generated simultaneously with it, but the audio features (e.g., MFCCs, fbank, etc.) are better differentiated at many times, and the visual and audio data do not need to be additionally aligned.

Disclosure of Invention

The invention aims at how to enable a lip language identification model to learn the distinguishing expression characteristics under the condition of limited data quantity, so that the invention provides a lip language identification method based on an audio-visual collaborative training mechanism.

Aiming at the defects of the prior art, the invention provides a lip language identification method with audio-visual cooperation, which comprises the following steps:

Step 1, acquiring a lip language identification data set containing speaker face videos, wherein each speaker face video has a label type; lip region extraction is carried out on the face video of the speaker to obtain visual mode data, and audio waveform sampling and feature extraction are carried out on the face video of the speaker to obtain audio mode data; binding the visual mode data and the audio mode data of each speaker face video as a sample;

Step 2, randomly extracting N different label categories in the lip language identification data set, randomly dividing the extracted sample number of each category, and respectively obtaining a visual support set, a visual query set, an audio support set and an audio query set of each category as a batch of training data;

step 3, for each iteration training, inputting the visual support set and the visual query set into a visual encoder, and inputting the audio support set and the audio query set into an audio encoder to respectively obtain video sequence characteristics X ^(v) and audio sequence characteristics X ^(a) and prototype representations of each sample;

Step 4, calculating centers of prototype representations of each category in the support set of the audio and video modes respectively to obtain all category centers of the support set; computing a sample of a query set The distances between the query set data and the centers of the support set categories after the mapping function are respectively passed through, and the probability that the query set data belong to each category is calculated according to the distances, so as to obtain cross-modal loss functions of the audio encoder and the video encoder;

step 5, calculating each inquiry set sample according to the probability Distances from the center of each support set class representation and probabilities belonging to class n to calculate single mode loss functions for the audio encoder and the video encoder, respectively; constructing a final loss function according to the cross-modal loss function and the single-modal loss function;

Step 6, circularly executing the step 2 to the step 5 until the final loss function, and storing current visual encoder model parameters; and connecting a classifier constructed by the full-connection layer at the output end of the visual encoder to form a lip language identification model, and inputting a video to be lip language identification into the lip language identification model to obtain a lip language identification result.

The audio-visual collaborative lip language identification method comprises the following steps:

Where X represents input video sequence data, θ ^(v) represents visual encoder parameters, and θ ^(a) represents audio encoder parameters.

the representation of each support set category center is obtained by:

center representation of class n in support set respectively representing audio-visual two modes,/> Referring to the number of samples of class n in S _t, pi (·) represents the indication function, y _X represents the redefined sample class;

Computing query set sample data The distances between the query set data and the center of each support set category after the query set data pass through the mapping functions f _v→a and g _a→v respectively, and the probability that the query set data belongs to the category n is calculated according to the distances, wherein the calculation mode is as follows:

f _v→a and g _a→v refer to the mapping functions of the visual vector to audio space and the audio vector to visual space, respectively, d (·,) refers to the distance of the two vectors, here euclidean distance;

calculating an optimized loss function of the audio-visual cross-mode collaborative training, and according to the process, the cross-mode loss function of the encoder model of the audio-visual two modes can be expressed as follows:

Metric learning is performed on prototype representations of different types of single modes of audio and video respectively, and then the obtained model representations are obtained On the basis of (1) calculating sample data/>, of each query setThe distance represented by the support set center of each category is calculated, and the probability of belonging to the category n is calculated as follows:

The optimization loss function based on measurement learning in each of the audio-visual two modes is calculated, and an optimization target is defined as the possibility that a maximization model classifies a query set sample into a correct class y _q:

The final loss function is calculated as follows:

L_AVS(θ^(a),θ^(v))＝λ_vL_v(θ^(v))+λ_aL_a(θ^(a))+λ_a→vL_a→v(θ^(v))+λ_v→aL_v→a(θ^(a))

wherein the lambda _v,λ_a,λ_a→v,λ_v→a is a component of the total energy, Are weight coefficients.

Any of the audio-visual collaborative lip language identification methods, wherein the step 6 further comprises:

and performing fine tuning training based on cross entropy loss on the lip recognition model by using the lip recognition data set so as to optimize the lip recognition model.

The invention also provides a lip language identification system with audio-visual cooperation, which comprises:

The module 1 is used for acquiring a lip language identification data set containing the speaker face videos, wherein each speaker face video has a label type; lip region extraction is carried out on the face video of the speaker to obtain visual mode data, and audio waveform sampling and feature extraction are carried out on the face video of the speaker to obtain audio mode data; binding the visual mode data and the audio mode data of each speaker face video as a sample;

The module 2 is used for randomly extracting N different label categories in the lip language identification data set, randomly dividing the extracted sample number of each category, and respectively obtaining a visual support set, a visual query set, an audio support set and an audio query set of each category as a batch of training data;

A module 3, configured to input the visual support set and the visual query set into a visual encoder, and input the audio support set and the audio query set into an audio encoder for each iteration training, to obtain a video sequence feature X ^(v) and an audio sequence feature X ^(a), and a prototype representation of each sample, respectively;

the module 4 is used for respectively calculating the centers of prototype representations of each category in the support set of the audio and video modes to obtain all the category centers of the support set; computing a sample of a query set The distances between the query set data and the centers of the support set categories after the mapping function are respectively passed through, and the probability that the query set data belong to each category is calculated according to the distances, so as to obtain cross-modal loss functions of the audio encoder and the video encoder;

A module 5 for calculating each sample of the query set based on the probabilities Distances from the center of each support set class representation and probabilities belonging to class n to calculate single mode loss functions for the audio encoder and the video encoder, respectively; constructing a final loss function according to the cross-modal loss function and the single-modal loss function;

A module 6 for circularly executing the module 2 to the module 5 until the final loss function, and saving current visual encoder model parameters; and connecting a classifier constructed by the full-connection layer at the output end of the visual encoder to form a lip language identification model, and inputting a video to be lip language identification into the lip language identification model to obtain a lip language identification result.

The audio-visual collaborative lip language recognition system, wherein the module 3 comprises:

The audio-visual collaborative lip language recognition system, wherein the module 4 comprises:

the representation of each support set category center is obtained by:

The audio-visual collaborative lip language recognition system, wherein the module 5 comprises:

The final loss function is calculated as follows:

Any of the audio-visual collaborative lip language recognition systems, wherein the module 6 further comprises:

The advantages of the invention are as follows:

Compared with the prior art, the invention not only considers the realization of the audio-visual aid by utilizing the measurement learning of different layers based on the time synchronism and the content consistency of the audio-visual nature, but also introduces a learning mechanism based on the pre-training, thereby improving the feature extraction capability of the visual model to the greatest extent, further improving the performance in the lip recognition task and achieving the best classification performance in two typical large dictionary lip recognition data sets (LRW and LRW-1000).

Drawings

FIG. 1 is a flow chart of a lip language identification method based on audio-visual collaborative pre-training according to the invention;

fig. 2 is a schematic view of a lip region cut.

Detailed Description

According to the method, the characteristics of audio-video synchronization are utilized, a lip language recognition mode of audio-video collaborative learning is provided, in the method, three layers of measurement learning of vision-vision, audio-audio, vision-audio and the like are designed, and three measurement learning mechanisms are performed simultaneously, so that training time and stage are shortened, and collaborative learning between two modes of vision and audio can be performed better. With the help of the audio information, the visual model can extract the characteristic with more distinguishing degree, so that the performance of the lip language identification model is improved. The invention comprises the following key technical points:

The key point 1, the invention provides a audio-visual collaborative learning mechanism, which utilizes audio to assist in visual model learning, and designs a measurement learning mode of three layers of audio-audio, visual-visual and audio-visual so that the model can simultaneously perform intra-mode and inter-mode feature representation learning, and the mode also simplifies the mode of audio-visual collaborative learning;

the key point 2 is that a learning mechanism which is easy to get is introduced to process the lip language recognition task, firstly, pre-training based on audio-visual cooperation is carried out on the model, so that the model can fully distinguish and compare the different types of samples for learning, and then fine adjustment is carried out on the final recognition task on the basis;

The key point 3 organically combines the audio-visual collaborative learning mechanism and the learning mechanism based on pre-training, thereby not only realizing the help of audio to the visual model, but also enabling the model to better distinguish the differences among different types of samples, and further completing the task of lip language identification.

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a lip language identification method based on audio-visual collaborative pre-training, and the following details of each step involved in the technical scheme of the present invention are given:

Step 1: visual modality data processing:

(1) Detecting the face position of a speaker in each frame by using a face detector;

(2) Detecting key points of the detected face;

(3) Cutting the lip region according to the detected facial key points such as the nose tip, two corners of the mouth, chin and the like, so that the cut region is complete and effective, as shown in a dotted line frame region shown in fig. 2;

(4) Converting each cut lip region picture into a gray picture;

step 2: audio modality data processing

(1) Sampling the audio waveform with a certain sampling rate, for example, a sampling rate of 16KHZ can be adopted;

(2) Converting the sampled audio data into a d-dimensional MFCC feature sequence by utilizing kaldi libraries and the like;

Step3: visual encoder design

(1) Designing a common 3D convolution layer (i.e. a layer consisting of a single 3D convolution operation) and a ResNet a module, cascading them as a frame-level feature extractor as the encoder front-end;

(2) Designing a 3-layer transducer structure as a sequence-level feature extractor for performing time sequence modeling on sequence features obtained by serially connecting frame-level features obtained from the front end of an encoder;

(3) The above-described frame-level structure and sequence-level structure are concatenated as a visual encoder E _θ(v);

step 4: audio encoder design

Similar to step 3 (2), a 3-layer transducer structure is designed as the audio encoder E _θ(a);

step 5: training data sampling:

(1) An audio-visual mode pair of a video sample is taken as a sample;

(2) For each model iteration, randomly extracting a plurality of different label (Chinese phrase or English vocabulary) categories in a large dictionary data set, assuming that the total extracted category number is N, randomly dividing the extracted sample number of each category, respectively constructing a support set (S _t, support set) and a query set (Q _t, query set) of each category, wherein the support set represents samples with the same label category to calculate the center vector of the label category, the query set is used for realizing model optimization in a measurement mode, and represents that the samples pass through each center vector to be measured, so that the closer the distance between the samples and the center vector with the same category is, the farther the center vector with different category is;

Step 6: repeatedly executing the step 5 to generate a plurality of training data batches;

step 7: audio-visual cross-mode collaborative pre-training;

(1) For each iterative training, the vision support set obtained by random sampling is used Audio support setVisual query set/>And audio query set/>Respectively, to the visual and audio encoders, respectively, to obtain prototype representations of each sample visual and audio (i.e., points of each sample in the embedded space), and to take the output vector of the last time step after each sequence has passed through the visual encoder and audio encoder as a representation of the entire sequence of inputs, namely:

Where X represents video sequence data used as input, θ ^(v) represents visual encoder parameters, θ ^(a) represents audio encoder parameters, and E represents an encoder model;

(2) Respectively calculating centers of prototype representations of each category in the support set of the audio and video modes to obtain center representations corresponding to each category:

In the above-mentioned method, the step of, Center representation of class n in support set respectively representing audio-visual two modes,/>Referring to the number of samples of class n in S _t, pi (·) represents the indicator function, and y _X represents the redefined sample class.

(3) Computing query set sample dataThe distances between the query set data and the center of each support set category after the mapping functions (f _v→a and g _a→v) are respectively passed through, and the probability that the query set data belongs to the category n is calculated according to the distances, wherein the calculation mode is as follows:

in the above formula, f _v→a and g _a→v refer to the mapping functions of the visual vector to the audio space and the audio vector to the visual space, respectively, and d (·,) refers to the distance between the two vectors, here euclidean distance.

(4) The optimization loss function of the audio-visual cross-mode collaborative training is calculated, and according to the process, the cross-mode optimization loss function of the encoder model of the audio-visual two modes can be expressed as follows:

Step 8: audio-visual single-mode metric learning

(1) At the same time of step 7, metric learning is also performed on prototype representations of different types of single modes of audio and video, and then the obtained model representations are obtainedOn the basis of (1) calculating each query set sample/>The distance represented by the support set center of each category is calculated, and the probability of belonging to the category n is calculated as follows:

in the above formula, the meanings of the letters are consistent with the above.

(2) The optimization loss function based on measurement learning in each of the audio-visual two modes is calculated, the optimization target is defined as the possibility that the maximization model classifies the query set sample into the correct class y _q, and the calculation mode is as follows:

(3) The loss function of the final audio-visual collaborative training stage is calculated by combining the audio-visual cross-mode collaborative training loss function and the single-mode measurement learning loss function, and the calculation mode is as follows:

in the above formula, lambda _v,λ_a,λ_a→v,λ_v→a, Are weight coefficients.

Step 9: original large dictionary data fine tuning

(1) Connecting a classifier constructed by a full-connection layer at the output end of the visual encoder so as to form a final lip language identification model structure;

(2) Loading the visual encoder model parameters obtained through the pre-training in the steps 7 and 8;

(3) Classifying and fine-tuning training is carried out on the newly obtained classification model by using the large dictionary full-class visual data, and the model is optimized by using cross entropy loss;

step 10: lip language identification model test:

After the steps, a trained lip language recognition model can be obtained, and in the test process, the final recognition result can be obtained only by inputting the sequence data of the visual mode.

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

the representation of each support set category center is obtained by:

/>

The final loss function is calculated as follows:

Claims

1. A lip language identification method of audio-visual cooperation is characterized by comprising the following steps:

step 5, calculating each inquiry set sample according to the probability Distances from the center of each support set class representation and probabilities belonging to class n to calculate single mode loss functions for the audio encoder and the video encoder, respectively; weighting and summing the cross-modal loss function and the single-modal loss function to obtain a final loss function;

Step 6, circularly executing the step 2 to the step 5 until the final loss function converges, and storing current visual encoder model parameters; and connecting a classifier constructed by the full-connection layer at the output end of the visual encoder to form a lip language identification model, and inputting a video to be lip language identification into the lip language identification model to obtain a lip language identification result.

2. The audio-visual collaborative lip recognition method according to claim 1, wherein the step 3 comprises:

3. The audio-visual collaborative lip recognition method according to claim 2, wherein the step 4 comprises:

the representation of each support set category center is obtained by:

f _v→a and g _a→v refer to the mapping functions of the visual vector to audio space and the audio vector to visual space, respectively, and d (·,) refers to the distance between the two vectors;

calculating an optimization loss function of audio-visual cross-mode collaborative training, wherein the cross-mode loss function of an encoder model of audio-visual two modes is expressed as:

4. the audio-visual collaborative lip recognition method according to claim 3, wherein the step 5 comprises:

The final loss function is calculated as follows:

5. The audio-visual collaborative lip recognition method according to any one of claims 1 to 4, wherein the step 6 further comprises:

6. An audio-visual collaborative lip language identification system, comprising:

A module 5 for calculating each sample of the query set based on the probabilities Distances from the center of each support set class representation and probabilities belonging to class n to calculate single mode loss functions for the audio encoder and the video encoder, respectively; weighting and summing the cross-modal loss function and the single-modal loss function to obtain a final loss function;

A module 6 for circularly executing the module 2 to the module 5 until the final loss function converges, and storing current visual encoder model parameters; and connecting a classifier constructed by the full-connection layer at the output end of the visual encoder to form a lip language identification model, and inputting a video to be lip language identification into the lip language identification model to obtain a lip language identification result.

7. The audio-visual collaborative lip recognition system of claim 6, wherein the module 3 comprises:

8. The audio-visual collaborative lip recognition system of claim 7, wherein the module 4 comprises:

the representation of each support set category center is obtained by:

9. The audio-visual collaborative lip recognition system of claim 8, wherein the module 5 comprises:

The final loss function is calculated as follows:

10. The audio-visual collaborative lip recognition system of any one of claims 6 to 9, wherein the module 6 further comprises: