CN113658582B - Lip language identification method and system for audio-visual collaboration - Google Patents

Lip language identification method and system for audio-visual collaboration Download PDF

Info

Publication number
CN113658582B
CN113658582B CN202110800963.XA CN202110800963A CN113658582B CN 113658582 B CN113658582 B CN 113658582B CN 202110800963 A CN202110800963 A CN 202110800963A CN 113658582 B CN113658582 B CN 113658582B
Authority
CN
China
Prior art keywords
audio
visual
data
category
lip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110800963.XA
Other languages
Chinese (zh)
Other versions
CN113658582A (en
Inventor
杨双
罗明双
山世光
陈熙霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202110800963.XA priority Critical patent/CN113658582B/en
Publication of CN113658582A publication Critical patent/CN113658582A/en
Application granted granted Critical
Publication of CN113658582B publication Critical patent/CN113658582B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a lip language identification method and a lip language identification system for audio-visual coordination, which are based on measurement learning of three layers of vision-vision, audio-audio, vision-audio and the like, and the three measurement learning mechanisms are performed simultaneously, so that training time and stage are shortened, and collaborative learning between two modes of vision and audio can be performed better. With the help of the audio information, the visual model can extract the characteristic with more distinguishing degree, so that the performance of the lip language identification model is improved.

Description

Lip language identification method and system for audio-visual collaboration
Technical Field
The present invention relates to the field of speech recognition and computer vision, and in particular to visual speech recognition and lip language recognition.
Background
Lip recognition, also known as visual speech recognition, refers to a technique of interpreting the content of a utterance spoken by a speaker by looking at the actions of the face and lips of the speaker as it speaks. The technology can be used as a supplement of voice recognition based on audio, makes up the defect of a voice recognition model based on audio in environments such as high noise and the like, and can be independently used in environments such as silence and the like to realize efficient transmission of speaking voice content, so that the technology has great application value in a human-computer interaction system; meanwhile, in recent years, with the advent of large-scale lip language recognition data sets and the wide application of deep learning techniques in the fields of computer vision, natural language processing, and the like, lip language recognition has gradually gained widespread attention and gradually started to play a role in actual scenes.
The lip language identification in the actual scene generally faces the problem of low resources of a large dictionary, and is characterized in that when speaking contents are judged through visual information, the related target language contents cover a larger dictionary range, and the related vocabulary contents have a wider range. Therefore, the task of lip language recognition in the actual scene generally faces the challenges of difficult extraction of visual fine-granularity features, difficult modeling of time dimension, large gesture and expression change of a speaker, complex and changeable external illumination conditions and the like faced by the conventional task of lip language recognition on one hand, and faces the conditions of easy confusion among different words, more short words, difficult recognition and the like brought by a large dictionary unique to the task on the other hand.
At present, most lip language identification methods are visual single-mode strong supervision identification methods based on large-scale labeled data and deep neural network frameworks, and have better performance on some disclosed data sets, but still have many defects and limitations. On the one hand, the large-scale acquisition of the labeled data is a problem that a large amount of manpower and material resources are consumed, the accurate lip language identification data usually needs to be strictly time-stamped before and after, and a large amount of lip language identification data usually needs to be marked, so that the step is very tedious and difficult. On the other hand, when a person speaks, sound and lip movement are generated synchronously. Thus, how to use speech data to improve the performance of the lip recognition model is a valuable matter, but currently existing methods rarely use speech data to aid in vision-based learning of the lip recognition model.
When the research in the field of lip language identification is carried out, the lip language identification task is the same as other video classification, and the most important key point is to extract the most discriminative characteristic of each lip reading video. However, lip recognition faces challenges that are different from other video classification tasks.
First, the task of lip recognition is a very challenging task, we know that the lip area is very limited, and the corresponding speaking content is very diverse, and the objective environment in which the speaking process is performed is complex and variable, including external light, the gesture of the speaker, etc., and these factors make the lip recognition a very challenging task.
Secondly, labeling of the lip-identifying data is also a very cumbersome task, requiring not only strict start and stop time stamps, but also corresponding text labels. Most importantly, in the scene of a large dictionary, some visual picture samples of different word classes (such as 'WHICH' and 'WHILE') have obvious similarity, which puts higher demands on the discrimination capability of the model.
In addition, the same word class samples can also have large variability due to factors such as speaking speed, speaker appearance (including posture, age, make-up, personal habits, etc.). These problems further increase the difficulty of lip recognition because limited training data is difficult to cover all samples of different situations. The existing lip language identification methods are basically based on a large-scale lip language identification data set, and few consideration is given to how to obtain a model with better performance under the condition of low resources to process the lip language identification task.
At the same time, we note that not only does we speak about the lip motion, but also the audio that is generated simultaneously with it, but the audio features (e.g., MFCCs, fbank, etc.) are better differentiated at many times, and the visual and audio data do not need to be additionally aligned.
Disclosure of Invention
The invention aims at how to enable a lip language identification model to learn the distinguishing expression characteristics under the condition of limited data quantity, so that the invention provides a lip language identification method based on an audio-visual collaborative training mechanism.
Aiming at the defects of the prior art, the invention provides a lip language identification method with audio-visual cooperation, which comprises the following steps:
Step 1, acquiring a lip language identification data set containing speaker face videos, wherein each speaker face video has a label type; lip region extraction is carried out on the face video of the speaker to obtain visual mode data, and audio waveform sampling and feature extraction are carried out on the face video of the speaker to obtain audio mode data; binding the visual mode data and the audio mode data of each speaker face video as a sample;
Step 2, randomly extracting N different label categories in the lip language identification data set, randomly dividing the extracted sample number of each category, and respectively obtaining a visual support set, a visual query set, an audio support set and an audio query set of each category as a batch of training data;
step 3, for each iteration training, inputting the visual support set and the visual query set into a visual encoder, and inputting the audio support set and the audio query set into an audio encoder to respectively obtain video sequence characteristics X (v) and audio sequence characteristics X (a) and prototype representations of each sample;
Step 4, calculating centers of prototype representations of each category in the support set of the audio and video modes respectively to obtain all category centers of the support set; computing a sample of a query set The distances between the query set data and the centers of the support set categories after the mapping function are respectively passed through, and the probability that the query set data belong to each category is calculated according to the distances, so as to obtain cross-modal loss functions of the audio encoder and the video encoder;
step 5, calculating each inquiry set sample according to the probability Distances from the center of each support set class representation and probabilities belonging to class n to calculate single mode loss functions for the audio encoder and the video encoder, respectively; constructing a final loss function according to the cross-modal loss function and the single-modal loss function;
Step 6, circularly executing the step 2 to the step 5 until the final loss function, and storing current visual encoder model parameters; and connecting a classifier constructed by the full-connection layer at the output end of the visual encoder to form a lip language identification model, and inputting a video to be lip language identification into the lip language identification model to obtain a lip language identification result.
The audio-visual collaborative lip language identification method comprises the following steps:
Where X represents input video sequence data, θ (v) represents visual encoder parameters, and θ (a) represents audio encoder parameters.
The audio-visual collaborative lip language identification method comprises the following steps:
the representation of each support set category center is obtained by:
center representation of class n in support set respectively representing audio-visual two modes,/> Referring to the number of samples of class n in S t, pi (·) represents the indication function, y X represents the redefined sample class;
Computing query set sample data The distances between the query set data and the center of each support set category after the query set data pass through the mapping functions f v→a and g a→v respectively, and the probability that the query set data belongs to the category n is calculated according to the distances, wherein the calculation mode is as follows:
f v→a and g a→v refer to the mapping functions of the visual vector to audio space and the audio vector to visual space, respectively, d (·,) refers to the distance of the two vectors, here euclidean distance;
calculating an optimized loss function of the audio-visual cross-mode collaborative training, and according to the process, the cross-mode loss function of the encoder model of the audio-visual two modes can be expressed as follows:
The audio-visual collaborative lip language identification method comprises the following steps:
Metric learning is performed on prototype representations of different types of single modes of audio and video respectively, and then the obtained model representations are obtained On the basis of (1) calculating sample data/>, of each query setThe distance represented by the support set center of each category is calculated, and the probability of belonging to the category n is calculated as follows:
The optimization loss function based on measurement learning in each of the audio-visual two modes is calculated, and an optimization target is defined as the possibility that a maximization model classifies a query set sample into a correct class y q:
The final loss function is calculated as follows:
LAVS(a)(v))=λvLv(v))+λaLa(a))+λa→vLa→v(v))+λv→aLv→a(a))
wherein the lambda vaa→vv→a is a component of the total energy, Are weight coefficients.
Any of the audio-visual collaborative lip language identification methods, wherein the step 6 further comprises:
and performing fine tuning training based on cross entropy loss on the lip recognition model by using the lip recognition data set so as to optimize the lip recognition model.
The invention also provides a lip language identification system with audio-visual cooperation, which comprises:
The module 1 is used for acquiring a lip language identification data set containing the speaker face videos, wherein each speaker face video has a label type; lip region extraction is carried out on the face video of the speaker to obtain visual mode data, and audio waveform sampling and feature extraction are carried out on the face video of the speaker to obtain audio mode data; binding the visual mode data and the audio mode data of each speaker face video as a sample;
The module 2 is used for randomly extracting N different label categories in the lip language identification data set, randomly dividing the extracted sample number of each category, and respectively obtaining a visual support set, a visual query set, an audio support set and an audio query set of each category as a batch of training data;
A module 3, configured to input the visual support set and the visual query set into a visual encoder, and input the audio support set and the audio query set into an audio encoder for each iteration training, to obtain a video sequence feature X (v) and an audio sequence feature X (a), and a prototype representation of each sample, respectively;
the module 4 is used for respectively calculating the centers of prototype representations of each category in the support set of the audio and video modes to obtain all the category centers of the support set; computing a sample of a query set The distances between the query set data and the centers of the support set categories after the mapping function are respectively passed through, and the probability that the query set data belong to each category is calculated according to the distances, so as to obtain cross-modal loss functions of the audio encoder and the video encoder;
A module 5 for calculating each sample of the query set based on the probabilities Distances from the center of each support set class representation and probabilities belonging to class n to calculate single mode loss functions for the audio encoder and the video encoder, respectively; constructing a final loss function according to the cross-modal loss function and the single-modal loss function;
A module 6 for circularly executing the module 2 to the module 5 until the final loss function, and saving current visual encoder model parameters; and connecting a classifier constructed by the full-connection layer at the output end of the visual encoder to form a lip language identification model, and inputting a video to be lip language identification into the lip language identification model to obtain a lip language identification result.
The audio-visual collaborative lip language recognition system, wherein the module 3 comprises:
Where X represents input video sequence data, θ (v) represents visual encoder parameters, and θ (a) represents audio encoder parameters.
The audio-visual collaborative lip language recognition system, wherein the module 4 comprises:
the representation of each support set category center is obtained by:
center representation of class n in support set respectively representing audio-visual two modes,/> Referring to the number of samples of class n in S t, pi (·) represents the indication function, y X represents the redefined sample class;
Computing query set sample data The distances between the query set data and the center of each support set category after the query set data pass through the mapping functions f v→a and g a→v respectively, and the probability that the query set data belongs to the category n is calculated according to the distances, wherein the calculation mode is as follows:
f v→a and g a→v refer to the mapping functions of the visual vector to audio space and the audio vector to visual space, respectively, d (·,) refers to the distance of the two vectors, here euclidean distance;
calculating an optimized loss function of the audio-visual cross-mode collaborative training, and according to the process, the cross-mode loss function of the encoder model of the audio-visual two modes can be expressed as follows:
The audio-visual collaborative lip language recognition system, wherein the module 5 comprises:
Metric learning is performed on prototype representations of different types of single modes of audio and video respectively, and then the obtained model representations are obtained On the basis of (1) calculating sample data/>, of each query setThe distance represented by the support set center of each category is calculated, and the probability of belonging to the category n is calculated as follows:
The optimization loss function based on measurement learning in each of the audio-visual two modes is calculated, and an optimization target is defined as the possibility that a maximization model classifies a query set sample into a correct class y q:
The final loss function is calculated as follows:
LAVS(a)(v))=λvLv(v))+λaLa(a))+λa→vLa→v(v))+λv→aLv→a(a))
wherein the lambda vaa→vv→a is a component of the total energy, Are weight coefficients.
Any of the audio-visual collaborative lip language recognition systems, wherein the module 6 further comprises:
and performing fine tuning training based on cross entropy loss on the lip recognition model by using the lip recognition data set so as to optimize the lip recognition model.
The advantages of the invention are as follows:
Compared with the prior art, the invention not only considers the realization of the audio-visual aid by utilizing the measurement learning of different layers based on the time synchronism and the content consistency of the audio-visual nature, but also introduces a learning mechanism based on the pre-training, thereby improving the feature extraction capability of the visual model to the greatest extent, further improving the performance in the lip recognition task and achieving the best classification performance in two typical large dictionary lip recognition data sets (LRW and LRW-1000).
Drawings
FIG. 1 is a flow chart of a lip language identification method based on audio-visual collaborative pre-training according to the invention;
fig. 2 is a schematic view of a lip region cut.
Detailed Description
According to the method, the characteristics of audio-video synchronization are utilized, a lip language recognition mode of audio-video collaborative learning is provided, in the method, three layers of measurement learning of vision-vision, audio-audio, vision-audio and the like are designed, and three measurement learning mechanisms are performed simultaneously, so that training time and stage are shortened, and collaborative learning between two modes of vision and audio can be performed better. With the help of the audio information, the visual model can extract the characteristic with more distinguishing degree, so that the performance of the lip language identification model is improved. The invention comprises the following key technical points:
The key point 1, the invention provides a audio-visual collaborative learning mechanism, which utilizes audio to assist in visual model learning, and designs a measurement learning mode of three layers of audio-audio, visual-visual and audio-visual so that the model can simultaneously perform intra-mode and inter-mode feature representation learning, and the mode also simplifies the mode of audio-visual collaborative learning;
the key point 2 is that a learning mechanism which is easy to get is introduced to process the lip language recognition task, firstly, pre-training based on audio-visual cooperation is carried out on the model, so that the model can fully distinguish and compare the different types of samples for learning, and then fine adjustment is carried out on the final recognition task on the basis;
The key point 3 organically combines the audio-visual collaborative learning mechanism and the learning mechanism based on pre-training, thereby not only realizing the help of audio to the visual model, but also enabling the model to better distinguish the differences among different types of samples, and further completing the task of lip language identification.
In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of a lip language identification method based on audio-visual collaborative pre-training, and the following details of each step involved in the technical scheme of the present invention are given:
Step 1: visual modality data processing:
(1) Detecting the face position of a speaker in each frame by using a face detector;
(2) Detecting key points of the detected face;
(3) Cutting the lip region according to the detected facial key points such as the nose tip, two corners of the mouth, chin and the like, so that the cut region is complete and effective, as shown in a dotted line frame region shown in fig. 2;
(4) Converting each cut lip region picture into a gray picture;
step 2: audio modality data processing
(1) Sampling the audio waveform with a certain sampling rate, for example, a sampling rate of 16KHZ can be adopted;
(2) Converting the sampled audio data into a d-dimensional MFCC feature sequence by utilizing kaldi libraries and the like;
Step3: visual encoder design
(1) Designing a common 3D convolution layer (i.e. a layer consisting of a single 3D convolution operation) and a ResNet a module, cascading them as a frame-level feature extractor as the encoder front-end;
(2) Designing a 3-layer transducer structure as a sequence-level feature extractor for performing time sequence modeling on sequence features obtained by serially connecting frame-level features obtained from the front end of an encoder;
(3) The above-described frame-level structure and sequence-level structure are concatenated as a visual encoder E θ(v);
step 4: audio encoder design
Similar to step 3 (2), a 3-layer transducer structure is designed as the audio encoder E θ(a);
step 5: training data sampling:
(1) An audio-visual mode pair of a video sample is taken as a sample;
(2) For each model iteration, randomly extracting a plurality of different label (Chinese phrase or English vocabulary) categories in a large dictionary data set, assuming that the total extracted category number is N, randomly dividing the extracted sample number of each category, respectively constructing a support set (S t, support set) and a query set (Q t, query set) of each category, wherein the support set represents samples with the same label category to calculate the center vector of the label category, the query set is used for realizing model optimization in a measurement mode, and represents that the samples pass through each center vector to be measured, so that the closer the distance between the samples and the center vector with the same category is, the farther the center vector with different category is;
Step 6: repeatedly executing the step 5 to generate a plurality of training data batches;
step 7: audio-visual cross-mode collaborative pre-training;
(1) For each iterative training, the vision support set obtained by random sampling is used Audio support setVisual query set/>And audio query set/>Respectively, to the visual and audio encoders, respectively, to obtain prototype representations of each sample visual and audio (i.e., points of each sample in the embedded space), and to take the output vector of the last time step after each sequence has passed through the visual encoder and audio encoder as a representation of the entire sequence of inputs, namely:
Where X represents video sequence data used as input, θ (v) represents visual encoder parameters, θ (a) represents audio encoder parameters, and E represents an encoder model;
(2) Respectively calculating centers of prototype representations of each category in the support set of the audio and video modes to obtain center representations corresponding to each category:
In the above-mentioned method, the step of, Center representation of class n in support set respectively representing audio-visual two modes,/>Referring to the number of samples of class n in S t, pi (·) represents the indicator function, and y X represents the redefined sample class.
(3) Computing query set sample dataThe distances between the query set data and the center of each support set category after the mapping functions (f v→a and g a→v) are respectively passed through, and the probability that the query set data belongs to the category n is calculated according to the distances, wherein the calculation mode is as follows:
in the above formula, f v→a and g a→v refer to the mapping functions of the visual vector to the audio space and the audio vector to the visual space, respectively, and d (·,) refers to the distance between the two vectors, here euclidean distance.
(4) The optimization loss function of the audio-visual cross-mode collaborative training is calculated, and according to the process, the cross-mode optimization loss function of the encoder model of the audio-visual two modes can be expressed as follows:
Step 8: audio-visual single-mode metric learning
(1) At the same time of step 7, metric learning is also performed on prototype representations of different types of single modes of audio and video, and then the obtained model representations are obtainedOn the basis of (1) calculating each query set sample/>The distance represented by the support set center of each category is calculated, and the probability of belonging to the category n is calculated as follows:
in the above formula, the meanings of the letters are consistent with the above.
(2) The optimization loss function based on measurement learning in each of the audio-visual two modes is calculated, the optimization target is defined as the possibility that the maximization model classifies the query set sample into the correct class y q, and the calculation mode is as follows:
in the above formula, the meanings of the letters are consistent with the above.
(3) The loss function of the final audio-visual collaborative training stage is calculated by combining the audio-visual cross-mode collaborative training loss function and the single-mode measurement learning loss function, and the calculation mode is as follows:
LAVS(a)(v))=λvLv(v))+λaLa(a))+λa→vLa→v(v))+λv→aLv→a(a))
in the above formula, lambda vaa→vv→a, Are weight coefficients.
Step 9: original large dictionary data fine tuning
(1) Connecting a classifier constructed by a full-connection layer at the output end of the visual encoder so as to form a final lip language identification model structure;
(2) Loading the visual encoder model parameters obtained through the pre-training in the steps 7 and 8;
(3) Classifying and fine-tuning training is carried out on the newly obtained classification model by using the large dictionary full-class visual data, and the model is optimized by using cross entropy loss;
step 10: lip language identification model test:
After the steps, a trained lip language recognition model can be obtained, and in the test process, the final recognition result can be obtained only by inputting the sequence data of the visual mode.
The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a lip language identification system with audio-visual cooperation, which comprises:
The module 1 is used for acquiring a lip language identification data set containing the speaker face videos, wherein each speaker face video has a label type; lip region extraction is carried out on the face video of the speaker to obtain visual mode data, and audio waveform sampling and feature extraction are carried out on the face video of the speaker to obtain audio mode data; binding the visual mode data and the audio mode data of each speaker face video as a sample;
The module 2 is used for randomly extracting N different label categories in the lip language identification data set, randomly dividing the extracted sample number of each category, and respectively obtaining a visual support set, a visual query set, an audio support set and an audio query set of each category as a batch of training data;
A module 3, configured to input the visual support set and the visual query set into a visual encoder, and input the audio support set and the audio query set into an audio encoder for each iteration training, to obtain a video sequence feature X (v) and an audio sequence feature X (a), and a prototype representation of each sample, respectively;
the module 4 is used for respectively calculating the centers of prototype representations of each category in the support set of the audio and video modes to obtain all the category centers of the support set; computing a sample of a query set The distances between the query set data and the centers of the support set categories after the mapping function are respectively passed through, and the probability that the query set data belong to each category is calculated according to the distances, so as to obtain cross-modal loss functions of the audio encoder and the video encoder;
A module 5 for calculating each sample of the query set based on the probabilities Distances from the center of each support set class representation and probabilities belonging to class n to calculate single mode loss functions for the audio encoder and the video encoder, respectively; constructing a final loss function according to the cross-modal loss function and the single-modal loss function;
A module 6 for circularly executing the module 2 to the module 5 until the final loss function, and saving current visual encoder model parameters; and connecting a classifier constructed by the full-connection layer at the output end of the visual encoder to form a lip language identification model, and inputting a video to be lip language identification into the lip language identification model to obtain a lip language identification result.
The audio-visual collaborative lip language recognition system, wherein the module 3 comprises:
Where X represents input video sequence data, θ (v) represents visual encoder parameters, and θ (a) represents audio encoder parameters.
The audio-visual collaborative lip language recognition system, wherein the module 4 comprises:
the representation of each support set category center is obtained by:
center representation of class n in support set respectively representing audio-visual two modes,/> Referring to the number of samples of class n in S t, pi (·) represents the indication function, y X represents the redefined sample class;
Computing query set sample data The distances between the query set data and the center of each support set category after the query set data pass through the mapping functions f v→a and g a→v respectively, and the probability that the query set data belongs to the category n is calculated according to the distances, wherein the calculation mode is as follows:
f v→a and g a→v refer to the mapping functions of the visual vector to audio space and the audio vector to visual space, respectively, d (·,) refers to the distance of the two vectors, here euclidean distance;
calculating an optimized loss function of the audio-visual cross-mode collaborative training, and according to the process, the cross-mode loss function of the encoder model of the audio-visual two modes can be expressed as follows:
The audio-visual collaborative lip language recognition system, wherein the module 5 comprises:
Metric learning is performed on prototype representations of different types of single modes of audio and video respectively, and then the obtained model representations are obtained On the basis of (1) calculating sample data/>, of each query setThe distance represented by the support set center of each category is calculated, and the probability of belonging to the category n is calculated as follows:
The optimization loss function based on measurement learning in each of the audio-visual two modes is calculated, and an optimization target is defined as the possibility that a maximization model classifies a query set sample into a correct class y q:
/>
The final loss function is calculated as follows:
LAVS(a)(v))=λvLv(v))+λaLa(a))+λa→vLa→v(v))+λv→aLv→a(a))
wherein the lambda vaa→vv→a is a component of the total energy, Are weight coefficients.
Any of the audio-visual collaborative lip language recognition systems, wherein the module 6 further comprises:
and performing fine tuning training based on cross entropy loss on the lip recognition model by using the lip recognition data set so as to optimize the lip recognition model.

Claims (10)

1. A lip language identification method of audio-visual cooperation is characterized by comprising the following steps:
Step 1, acquiring a lip language identification data set containing speaker face videos, wherein each speaker face video has a label type; lip region extraction is carried out on the face video of the speaker to obtain visual mode data, and audio waveform sampling and feature extraction are carried out on the face video of the speaker to obtain audio mode data; binding the visual mode data and the audio mode data of each speaker face video as a sample;
Step 2, randomly extracting N different label categories in the lip language identification data set, randomly dividing the extracted sample number of each category, and respectively obtaining a visual support set, a visual query set, an audio support set and an audio query set of each category as a batch of training data;
step 3, for each iteration training, inputting the visual support set and the visual query set into a visual encoder, and inputting the audio support set and the audio query set into an audio encoder to respectively obtain video sequence characteristics X (v) and audio sequence characteristics X (a) and prototype representations of each sample;
Step 4, calculating centers of prototype representations of each category in the support set of the audio and video modes respectively to obtain all category centers of the support set; computing a sample of a query set The distances between the query set data and the centers of the support set categories after the mapping function are respectively passed through, and the probability that the query set data belong to each category is calculated according to the distances, so as to obtain cross-modal loss functions of the audio encoder and the video encoder;
step 5, calculating each inquiry set sample according to the probability Distances from the center of each support set class representation and probabilities belonging to class n to calculate single mode loss functions for the audio encoder and the video encoder, respectively; weighting and summing the cross-modal loss function and the single-modal loss function to obtain a final loss function;
Step 6, circularly executing the step 2 to the step 5 until the final loss function converges, and storing current visual encoder model parameters; and connecting a classifier constructed by the full-connection layer at the output end of the visual encoder to form a lip language identification model, and inputting a video to be lip language identification into the lip language identification model to obtain a lip language identification result.
2. The audio-visual collaborative lip recognition method according to claim 1, wherein the step 3 comprises:
Where X represents input video sequence data, θ (v) represents visual encoder parameters, and θ (a) represents audio encoder parameters.
3. The audio-visual collaborative lip recognition method according to claim 2, wherein the step 4 comprises:
the representation of each support set category center is obtained by:
center representation of class n in support set respectively representing audio-visual two modes,/> Referring to the number of samples of class n in S t, pi (·) represents the indication function, y X represents the redefined sample class;
Computing query set sample data The distances between the query set data and the center of each support set category after the query set data pass through the mapping functions f v→a and g a→v respectively, and the probability that the query set data belongs to the category n is calculated according to the distances, wherein the calculation mode is as follows:
f v→a and g a→v refer to the mapping functions of the visual vector to audio space and the audio vector to visual space, respectively, and d (·,) refers to the distance between the two vectors;
calculating an optimization loss function of audio-visual cross-mode collaborative training, wherein the cross-mode loss function of an encoder model of audio-visual two modes is expressed as:
4. the audio-visual collaborative lip recognition method according to claim 3, wherein the step 5 comprises:
Metric learning is performed on prototype representations of different types of single modes of audio and video respectively, and then the obtained model representations are obtained On the basis of (1) calculating sample data/>, of each query setThe distance represented by the support set center of each category is calculated, and the probability of belonging to the category n is calculated as follows:
The optimization loss function based on measurement learning in each of the audio-visual two modes is calculated, and an optimization target is defined as the possibility that a maximization model classifies a query set sample into a correct class y q:
The final loss function is calculated as follows:
LAVS(a)(v))=λvLv(v))+λaLa(a))+λa→vLa→v(v))+λv→aLv→a(a))
wherein the lambda vaa→vv→a is a component of the total energy, Are weight coefficients.
5. The audio-visual collaborative lip recognition method according to any one of claims 1 to 4, wherein the step 6 further comprises:
and performing fine tuning training based on cross entropy loss on the lip recognition model by using the lip recognition data set so as to optimize the lip recognition model.
6. An audio-visual collaborative lip language identification system, comprising:
The module 1 is used for acquiring a lip language identification data set containing the speaker face videos, wherein each speaker face video has a label type; lip region extraction is carried out on the face video of the speaker to obtain visual mode data, and audio waveform sampling and feature extraction are carried out on the face video of the speaker to obtain audio mode data; binding the visual mode data and the audio mode data of each speaker face video as a sample;
The module 2 is used for randomly extracting N different label categories in the lip language identification data set, randomly dividing the extracted sample number of each category, and respectively obtaining a visual support set, a visual query set, an audio support set and an audio query set of each category as a batch of training data;
A module 3, configured to input the visual support set and the visual query set into a visual encoder, and input the audio support set and the audio query set into an audio encoder for each iteration training, to obtain a video sequence feature X (v) and an audio sequence feature X (a), and a prototype representation of each sample, respectively;
the module 4 is used for respectively calculating the centers of prototype representations of each category in the support set of the audio and video modes to obtain all the category centers of the support set; computing a sample of a query set The distances between the query set data and the centers of the support set categories after the mapping function are respectively passed through, and the probability that the query set data belong to each category is calculated according to the distances, so as to obtain cross-modal loss functions of the audio encoder and the video encoder;
A module 5 for calculating each sample of the query set based on the probabilities Distances from the center of each support set class representation and probabilities belonging to class n to calculate single mode loss functions for the audio encoder and the video encoder, respectively; weighting and summing the cross-modal loss function and the single-modal loss function to obtain a final loss function;
A module 6 for circularly executing the module 2 to the module 5 until the final loss function converges, and storing current visual encoder model parameters; and connecting a classifier constructed by the full-connection layer at the output end of the visual encoder to form a lip language identification model, and inputting a video to be lip language identification into the lip language identification model to obtain a lip language identification result.
7. The audio-visual collaborative lip recognition system of claim 6, wherein the module 3 comprises:
Where X represents input video sequence data, θ (v) represents visual encoder parameters, and θ (a) represents audio encoder parameters.
8. The audio-visual collaborative lip recognition system of claim 7, wherein the module 4 comprises:
the representation of each support set category center is obtained by:
center representation of class n in support set respectively representing audio-visual two modes,/> Referring to the number of samples of class n in S t, pi (·) represents the indication function, y X represents the redefined sample class;
Computing query set sample data The distances between the query set data and the center of each support set category after the query set data pass through the mapping functions f v→a and g a→v respectively, and the probability that the query set data belongs to the category n is calculated according to the distances, wherein the calculation mode is as follows:
f v→a and g a→v refer to the mapping functions of the visual vector to audio space and the audio vector to visual space, respectively, and d (·,) refers to the distance between the two vectors;
calculating an optimization loss function of audio-visual cross-mode collaborative training, wherein the cross-mode loss function of an encoder model of audio-visual two modes is expressed as:
9. The audio-visual collaborative lip recognition system of claim 8, wherein the module 5 comprises:
Metric learning is performed on prototype representations of different types of single modes of audio and video respectively, and then the obtained model representations are obtained On the basis of (1) calculating sample data/>, of each query setThe distance represented by the support set center of each category is calculated, and the probability of belonging to the category n is calculated as follows:
The optimization loss function based on measurement learning in each of the audio-visual two modes is calculated, and an optimization target is defined as the possibility that a maximization model classifies a query set sample into a correct class y q:
The final loss function is calculated as follows:
LAVS(a)(v))=λvLv(v))+λaLa(a))+λa→vLa→v(v))+λv→aLv→a(a))
wherein the lambda vaa→vv→a is a component of the total energy, Are weight coefficients.
10. The audio-visual collaborative lip recognition system of any one of claims 6 to 9, wherein the module 6 further comprises:
and performing fine tuning training based on cross entropy loss on the lip recognition model by using the lip recognition data set so as to optimize the lip recognition model.
CN202110800963.XA 2021-07-15 2021-07-15 Lip language identification method and system for audio-visual collaboration Active CN113658582B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110800963.XA CN113658582B (en) 2021-07-15 2021-07-15 Lip language identification method and system for audio-visual collaboration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110800963.XA CN113658582B (en) 2021-07-15 2021-07-15 Lip language identification method and system for audio-visual collaboration

Publications (2)

Publication Number Publication Date
CN113658582A CN113658582A (en) 2021-11-16
CN113658582B true CN113658582B (en) 2024-05-07

Family

ID=78489461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110800963.XA Active CN113658582B (en) 2021-07-15 2021-07-15 Lip language identification method and system for audio-visual collaboration

Country Status (1)

Country Link
CN (1) CN113658582B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116828129B (en) * 2023-08-25 2023-11-03 小哆智能科技(北京)有限公司 Ultra-clear 2D digital person generation method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN110276259A (en) * 2019-05-21 2019-09-24 平安科技(深圳)有限公司 Lip reading recognition methods, device, computer equipment and storage medium
CN111223483A (en) * 2019-12-10 2020-06-02 浙江大学 Lip language identification method based on multi-granularity knowledge distillation
CN111753704A (en) * 2020-06-19 2020-10-09 南京邮电大学 Time sequence centralized prediction method based on video character lip reading recognition
CN112330713A (en) * 2020-11-26 2021-02-05 南京工程学院 Method for improving speech comprehension degree of severe hearing impaired patient based on lip language recognition
CN112397074A (en) * 2020-11-05 2021-02-23 桂林电子科技大学 Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning
WO2021135509A1 (en) * 2019-12-30 2021-07-08 腾讯科技(深圳)有限公司 Image processing method and apparatus, electronic device, and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060009978A1 (en) * 2004-07-02 2006-01-12 The Regents Of The University Of Colorado Methods and systems for synthesis of accurate visible speech via transformation of motion capture data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN110276259A (en) * 2019-05-21 2019-09-24 平安科技(深圳)有限公司 Lip reading recognition methods, device, computer equipment and storage medium
CN111223483A (en) * 2019-12-10 2020-06-02 浙江大学 Lip language identification method based on multi-granularity knowledge distillation
WO2021135509A1 (en) * 2019-12-30 2021-07-08 腾讯科技(深圳)有限公司 Image processing method and apparatus, electronic device, and storage medium
CN111753704A (en) * 2020-06-19 2020-10-09 南京邮电大学 Time sequence centralized prediction method based on video character lip reading recognition
CN112397074A (en) * 2020-11-05 2021-02-23 桂林电子科技大学 Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning
CN112330713A (en) * 2020-11-26 2021-02-05 南京工程学院 Method for improving speech comprehension degree of severe hearing impaired patient based on lip language recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals;Shah Nawaz;2019 Digital Image Computing: Techniques and Applications (DICTA);第1-7页 *

Also Published As

Publication number Publication date
CN113658582A (en) 2021-11-16

Similar Documents

Publication Publication Date Title
CN101101752B (en) Monosyllabic language lip-reading recognition system based on vision character
Gao et al. Transition movement models for large vocabulary continuous sign language recognition
WO2016150001A1 (en) Speech recognition method, device and computer storage medium
Sainath et al. Exemplar-based processing for speech recognition: An overview
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
Peng et al. Recognition of handwritten Chinese text by segmentation: a segment-annotation-free approach
CN105575388A (en) Emotional speech processing
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
CN109377981B (en) Phoneme alignment method and device
CN111128128B (en) Voice keyword detection method based on complementary model scoring fusion
CN112749646A (en) Interactive point-reading system based on gesture recognition
CN111554279A (en) Multi-mode man-machine interaction system based on Kinect
CN113658582B (en) Lip language identification method and system for audio-visual collaboration
Wang et al. WaveNet with cross-attention for audiovisual speech recognition
Ballard et al. A multimodal learning interface for word acquisition
Medjkoune et al. Combining speech and handwriting modalities for mathematical expression recognition
You et al. Manifolds based emotion recognition in speech
Yang et al. Sign language recognition system based on weighted hidden Markov model
CN115145402A (en) Intelligent toy system with network interaction function and control method
CN114171007A (en) System and method for aligning virtual human mouth shapes
Chandrakala et al. Combination of generative models and SVM based classifier for speech emotion recognition
CN112395414B (en) Text classification method, training method of classification model, training device of classification model, medium and training equipment
Melnyk et al. Towards computer assisted international sign language recognition system: a systematic survey
Huang et al. Phone classification via manifold learning based dimensionality reduction algorithms
Choudhury et al. Review of Various Machine Learning and Deep Learning Techniques for Audio Visual Automatic Speech Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant