CN111191490A

CN111191490A - Lip reading research method based on Kinect vision

Info

Publication number: CN111191490A
Application number: CN201811357055.2A
Authority: CN
Inventors: 喻梅; 马权智; 于健; 于瑞国; 王建荣; 徐天一; 赵满坤; 高洁; 岳帅
Original assignee: Tianjin University Marine Technology Research Institute
Current assignee: Tianjin University Marine Technology Research Institute
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2020-05-22

Abstract

A lip reading research method based on Kinect vision obtains 3D coordinate information of a lip region (namely a region of interest ROI) from acquired images and depth data, respectively trains and identifies coordinates and space angles as features, and explores lip reading research based on three-dimensional information; compared with a model-based method, the method saves more information; compared with a pixel-based method, the method avoids the influence of the background on the extracted data, and reduces the information dimension and redundancy.

Description

Lip reading research method based on Kinect vision

Technical Field

The invention belongs to the field of voice recognition, and particularly relates to a lip reading research method based on Kinect vision.

Background

In recent years, with the rapid development of Computer technology, especially the popularization of portable Computer technology, Human Computer interaction (HCI — Human Computer Interface) has gradually become an important life style for modern people. In the process of human communication, voice is undoubtedly a very important information medium, and a person's joy, anger, sadness and funeral can be transmitted through voice. Therefore, the man-machine interaction mode is mainly the voice mode, and the voice recognition technology is developed rapidly. Voice recognition systems such as voice search and voice input methods have become a major trend in the present society.

However, even the most sophisticated speech recognition system is difficult to adapt to a complex and variable environment in real life, especially a high-noise environment, and the recognition performance of the system is greatly reduced. Meanwhile, the advantages of the voice recognition system are difficult for hearing-impaired people or language-handicapped people. Psychological research shows that people unconsciously use visual information such as lip movements, expressions, gestures and the like to improve the comprehension of languages in a noise environment. In other words, human perception of language is multimodal, i.e. relies not only on the way audio information is exchanged, but also on visual information to aid understanding during communication. Therefore, the development of lip reading research is not only a great aid for the existing voice recognition system, but also a good news for hearing impaired people or language handicapped people. Thus, lip reading research has attracted the attention of the industry and has been vigorously developed. Lip reading research mainly involves the following aspects: lip region detection and positioning, feature extraction and training recognition. Wherein feature extraction is at the core position. The current feature extraction methods are mainly classified into three categories:

1) model-based methods abstract the lip contour into a mathematical model to obtain geometric features about the lips. The disadvantage is that certain models may lose some important information.

2) The pixel-based method, which uses the pixel information of a region of interest (ROI) as a feature vector, either directly or after some transformation, has the disadvantage that the feature vector is highly dimensional and highly redundant.

3) The first two methods are combined to extract features. Such as an AAM (active appearance model) algorithm, etc. After the features are extracted, the features are trained and recognized through an HMM model.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a lip reading research method based on Kinect vision, which obtains 3D coordinate information of a lip region (namely a region of interest ROI) from the acquired image and depth data, trains and identifies coordinates and space angles serving as features respectively, and explores lip reading research based on three-dimensional information; compared with a model-based method, the method saves more information; compared with a pixel-based method, the method avoids the influence of the background on the extracted data, and reduces the information dimension and redundancy.

A lip reading research method based on Kinect vision specifically comprises the following steps:

the method comprises the following steps: acquiring required three-dimensional face data through a Kinect, and preprocessing the data;

step two: positioning a lip region, extracting 18 characteristic points of the lip, and performing numbering modeling;

step three: extracting characteristics which are respectively angle characteristics among the characteristic points and coordinate characteristics of the characteristic points, and carrying out normalization processing;

step four: and training and recognizing the features by using a Hidden Markov Model (HMM) and a K-nearest neighbor algorithm (KNN).

Preprocessing acquired data, and specifically comprises the following steps:

and acquiring a corpus through collection, and sequencing the acquired data by labeling and time stamps, and storing all the data into a binary file. Preprocessing original data, wherein firstly, the whole section of audio is cut into a section of audio of each word, and corresponding color images and depth data are synchronously stored in the same position; secondly, the unqualified data needs to be removed and re-recorded.

Step two, lip area positioning is carried out on the basis of the step one, and the specific steps are as follows:

in the data acquisition stage, three-dimensional coordinate information of 121 feature points predefined by the face of a speaking person is obtained by using a Kinect sensor. Through a large number of experiments on the data, the serial numbers of 121 feature points of the face corresponding to the 18 feature points of the lip are obtained. And extracting and obtaining the three-dimensional coordinate information of all the characteristic points of the lip corresponding to each frame of image of each word according to the position relation of the 18 points.

Step three, extracting features on the basis of the step one and the step two, and specifically comprises the following steps:

there are two features that can be selected, corresponding to different selection and normalization methods. Firstly, analogy is carried out with a traditional feature extraction mode, angles among feature points are used as features, and the features are obtained by selecting KNN (K-nearest neighbor); secondly, the coordinates of 18 characteristic points of the lip are directly used as the characteristics, and training and recognition are carried out after normalization.

Step four, training and identifying the normalized data on the basis of the step one, the step two and the step three, and specifically comprising the following steps of:

training and recognizing according to the proportion of the training set to the test set being 3:2 and a full-training and full-recognition mode, wherein the training and recognizing mode adopts KNN algorithm classification and HMM model training and recognizing. By comparing the recognition result with the labels of the test set, the recognition rate can be obtained.

A lip reading research method based on Kinect vision uses three-dimensional coordinates and space angles as features to conduct training and recognition, and compared with the existing method, recognition accuracy is improved.

First, lip reading research is of great help to assist in perfecting speech recognition systems. Compared with a voice recognition system of an audio single channel, the recognition system combined with video channel information can be better represented in a complex environment and is more robust. Such as: the consonant letter/p/and/k/, the voiced consonant letter/b/and/d/and the like are difficult to distinguish depending on the audio channel, but the recognition accuracy of the system is greatly improved after the lip feature is combined.

Secondly, in the field of practical application, lip reading research can also assist in the aspects of personal identification, sign language identification and the like. Meanwhile, lip reading plays an important role in the recovery of speaking function of the acquired deaf-mute. In the present generation, mouth shape linguistic analytics also provides technical support in assisting criminal investigation and anti-terrorism. Through the research on the lip movement law, the lip reading technology also plays an important role in the aspects of speaker recognition, lip movement synthesis, voice-driven face image coding, voice-driven head movement synthesis and the like, and the corresponding application fields comprise coding, animated lip movement synthesis, virtual people in network automatic agent, mouth shape matching in dubbing and the like.

Finally, in terms of theoretical studies. It relates to a plurality of fields such as pattern recognition, computer vision, natural language processing, image processing and the like. The contents of a plurality of research fields are mutually promoted and also mutually examined and developed.

Drawings

FIG. 1 is a schematic flow chart of the lip reading study method of the present invention;

FIG. 2 is a schematic diagram of a Kinect three-dimensional coordinate system;

FIG. 3 is a schematic diagram of 121 feature points of a human face;

fig. 4 is a schematic diagram of lip region feature point renumbering.

Detailed Description

The invention is further described with reference to the following figures and examples, but the scope of the invention is not limited thereto.

A method for lip reading research based on Kinect vision, as shown in fig. 1, is a schematic process diagram of a specific embodiment of the present invention, and includes:

step S0101: the data in the preprocessing is three-dimensional data acquired by using Kinect tracking, and a three-dimensional tracking result output by a Kinect coordinate system is as follows: (x, y, z). Wherein the Z-axis is the distance from the sensor to the user, the Y-axis is pointing up and down, and the X-axis is pointing left and right. The data is measured in meters. The coordinate system is shown in fig. 2.

The three-dimensional coordinates of the human face at a certain moment are obtained by processing the color image and the depth image at the same moment. The Face Tracking SDK defines 121 feature points in advance on the Face, including all contour features of the Face, wherein there are 18 feature points in the lip region, as shown in fig. 3.

Step S0201: after obtaining the 3D data of the lip region, 3D modeling is performed on the lip region, and 18 feature points are renumbered first, with the clockwise order of the inner lip being 1 to 8 and the clockwise order of the outer lip being 9 to 18, as shown in fig. 4.

Step S0202: since the lip information obtained by us is three-dimensional coordinates, the superiority is hardly reflected in a two-dimensional image. Therefore, the three-dimensional lip contour model is obtained by performing three-dimensional modeling on the lip region by using MATLAB according to the three-dimensional coordinates of the lip region.

Step S0301: in the angular feature extraction, the contour of the lip has 18 coordinate points, and if angles are selected as features, the credibility of each angle, namely the recognition rate of each angle, needs to be analyzed. And (4) obtaining 2448 angles by arranging and combining the 18 characteristic points, classifying by using a KNN method to obtain the classification accuracy of each word, accordingly obtaining the credibility descending order of the angles, and selecting the high credibility angles as angle characteristics for extraction.

Step S0302: in the angle feature normalization process, the feature matrix of each word is nxl, where N is the number of picture frames corresponding to the corresponding word, and L is the feature dimension corresponding to each frame of image, i.e., the number of selected angles. The feature matrix is normalized, i.e., the data of each dimension is normalized to the interval [ -1,1 ]. Here, the cosine value is directly calculated for each angle to obtain the normalized result of the interval [ -1,1 ].

Step S0303: in the three-dimensional coordinate feature extraction, three-dimensional information of 18 feature points of the corresponding lips of each frame of image, namely for the features of three-dimensional data, the 18 points are analyzed and extracted. Considering that the three-dimensional coordinates of 18 points are information, we directly obtain the feature information of a frame of image by splicing the three-dimensional coordinates of 18 points.

Let the coordinates of 18 points be [ xi, yi, zi ], where i =1,2, …, 18. The resulting features after stitching are then [ x1, x2, …, x18, y1, y2, …, y18, z1, z2, …, z18 ]. Then, the three-dimensional coordinates corresponding to all the image frames of a word are spliced according to the method, and then a characteristic matrix of the three-dimensional data corresponding to the word is obtained, namely N is multiplied by 18, wherein N is the number of the image frames corresponding to the word. For example, the term "operation" of a person, where the number of frames of an image corresponding to the whole audio is 51, means that the feature matrix size obtained corresponding to 51 frames of three-dimensional data information is 51 × 18.

Step S0304: in the three-dimensional coordinate feature normalization process, all three-dimensional data are translated, and are displayed at the same position on a coordinate axis. Specifically, the 9 th feature point of each frame of the corresponding data, namely the left outer lip end point, is unified, namely the right mouth corner of the human. Then we rotate the image, i.e. the line connecting the left and right mouth corners, i.e. the line connecting feature point 9 and feature point 15, is taken as the x-axis, and the other points are correspondingly rotated. The rotated coordinates are used as the final coordinate feature.

Step S0401: after a KNN classifier is trained, classifying a test set, wherein the test set also has the same requirement, and needs to be spliced into a feature matrix like a training set and has the same columns, namely the dimensionality of each sample is kept consistent.

In classifying the test set, the required data includes:

1) the test set is in the same format as the training set and is provided with corresponding labels so as to obtain the accuracy of classification after classification;

2) k value, namely, during classification, considering the category attributes of k neighbors around the point to be classified, and then taking the category occupying the most neighbors as the category of the point;

3) the classification rules, i.e., the distance-measuring pattern of KNN, are commonly used as euclidean distance, manhattan distance, hamming distance, etc.

After obtaining the data, the classification result of each sample in the test set is obtained, and the accuracy of classification can be obtained by comparing the classification result with the label of the test set.

Step S0402: the specific steps for training the HMM model are as follows:

first, a data set is acquired. Firstly, three fifths of data are randomly extracted from a feature file (the feature file with the designated angle obtained after the angle is selected by KNN in the feature extraction process) to be used as a training sample (the other two fifths are test sets), and a label (labeled class name) is attached to the training sample.

Second, training the HMM model. After the data set is obtained, the HMM model is obtained through training for the training set of the same category. Some parameters of the HMM need to be defined in this process: the number of states comprising the HMM model, the number of gaussian mixture portions. After the number of states and the gaussian mixture parameters are given, the HMM model can be initialized and then iteratively updated to obtain the HMM model of the word.

Step S0403: the recognition process is namely the decoding problem in three basic problems of the HMM, and given a test set, i.e. an observation sequence known by us, searching the most probable hidden state sequence in the HMM model, we can obtain a probability, and by comparing the probabilities obtained by all HMM models respectively, select the HMM model with the highest probability as the recognition result. By comparing the recognition result with the label of the test set, the recognition rate can be obtained.

Claims

1. A lip reading research method based on Kinect vision is characterized by comprising the following steps: the method specifically comprises the following steps:

step four: and training and identifying the features by adopting a hidden Markov model and a K-nearest neighbor algorithm.

2. The method of Kinect vision-based lip reading research as claimed in claim 1, wherein: preprocessing acquired data, and specifically comprises the following steps:

acquiring a corpus through collection, and sequencing acquired data tags and timestamps, wherein all the acquired data tags and timestamps are stored in a binary file; preprocessing original data, wherein firstly, the whole section of audio is cut into a section of audio of each word, and corresponding color images and depth data are synchronously stored in the same position; secondly, the unqualified data needs to be removed and re-recorded.

3. The method of Kinect vision-based lip reading research as claimed in claim 1, wherein: step two, lip area positioning is carried out on the basis of the step one, and the specific steps are as follows:

in the data acquisition stage, three-dimensional coordinate information of 121 feature points predefined by the face of the speaker is obtained by using a Kinect sensor; through a large number of experiments on the data, the serial numbers of 121 feature points of the face corresponding to 18 feature points of the lip are obtained; and extracting and obtaining the three-dimensional coordinate information of all the characteristic points of the lip corresponding to each frame of image of each word according to the position relation of the 18 points.

4. The method of Kinect vision-based lip reading research as claimed in claim 1, wherein: step three, extracting features on the basis of the step one and the step two, and specifically comprises the following steps:

two characteristics can be selected and correspond to different selection and normalization methods; firstly, analogy is carried out with a traditional feature extraction mode, angles among feature points are used as features, and the features are obtained by selecting a K-nearest neighbor method; secondly, the coordinates of 18 characteristic points of the lip are directly used as the characteristics, and training and recognition are carried out after normalization.

5. The method of Kinect vision-based lip reading research as claimed in claim 1, wherein: step four, training and identifying the normalized data on the basis of the step one, the step two and the step three, and specifically comprising the following steps of:

training and recognizing according to the proportion of a training set to a test set being 3:2 and a full-training and full-recognition mode, wherein the training and recognizing mode adopts K-nearest neighbor algorithm classification and HMM model training and recognizing; by comparing the recognition result with the labels of the test set, the recognition rate can be obtained.