CN111191490A - Lip reading research method based on Kinect vision - Google Patents

Lip reading research method based on Kinect vision Download PDF

Info

Publication number
CN111191490A
CN111191490A CN201811357055.2A CN201811357055A CN111191490A CN 111191490 A CN111191490 A CN 111191490A CN 201811357055 A CN201811357055 A CN 201811357055A CN 111191490 A CN111191490 A CN 111191490A
Authority
CN
China
Prior art keywords
lip
data
training
kinect
steps
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811357055.2A
Other languages
Chinese (zh)
Inventor
喻梅
马权智
于健
于瑞国
王建荣
徐天一
赵满坤
高洁
岳帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University Marine Technology Research Institute
Original Assignee
Tianjin University Marine Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University Marine Technology Research Institute filed Critical Tianjin University Marine Technology Research Institute
Priority to CN201811357055.2A priority Critical patent/CN111191490A/en
Publication of CN111191490A publication Critical patent/CN111191490A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

A lip reading research method based on Kinect vision obtains 3D coordinate information of a lip region (namely a region of interest ROI) from acquired images and depth data, respectively trains and identifies coordinates and space angles as features, and explores lip reading research based on three-dimensional information; compared with a model-based method, the method saves more information; compared with a pixel-based method, the method avoids the influence of the background on the extracted data, and reduces the information dimension and redundancy.

Description

Lip reading research method based on Kinect vision
Technical Field
The invention belongs to the field of voice recognition, and particularly relates to a lip reading research method based on Kinect vision.
Background
In recent years, with the rapid development of Computer technology, especially the popularization of portable Computer technology, Human Computer interaction (HCI — Human Computer Interface) has gradually become an important life style for modern people. In the process of human communication, voice is undoubtedly a very important information medium, and a person's joy, anger, sadness and funeral can be transmitted through voice. Therefore, the man-machine interaction mode is mainly the voice mode, and the voice recognition technology is developed rapidly. Voice recognition systems such as voice search and voice input methods have become a major trend in the present society.
However, even the most sophisticated speech recognition system is difficult to adapt to a complex and variable environment in real life, especially a high-noise environment, and the recognition performance of the system is greatly reduced. Meanwhile, the advantages of the voice recognition system are difficult for hearing-impaired people or language-handicapped people. Psychological research shows that people unconsciously use visual information such as lip movements, expressions, gestures and the like to improve the comprehension of languages in a noise environment. In other words, human perception of language is multimodal, i.e. relies not only on the way audio information is exchanged, but also on visual information to aid understanding during communication. Therefore, the development of lip reading research is not only a great aid for the existing voice recognition system, but also a good news for hearing impaired people or language handicapped people. Thus, lip reading research has attracted the attention of the industry and has been vigorously developed. Lip reading research mainly involves the following aspects: lip region detection and positioning, feature extraction and training recognition. Wherein feature extraction is at the core position. The current feature extraction methods are mainly classified into three categories:
1) model-based methods abstract the lip contour into a mathematical model to obtain geometric features about the lips. The disadvantage is that certain models may lose some important information.
2) The pixel-based method, which uses the pixel information of a region of interest (ROI) as a feature vector, either directly or after some transformation, has the disadvantage that the feature vector is highly dimensional and highly redundant.
3) The first two methods are combined to extract features. Such as an AAM (active appearance model) algorithm, etc. After the features are extracted, the features are trained and recognized through an HMM model.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a lip reading research method based on Kinect vision, which obtains 3D coordinate information of a lip region (namely a region of interest ROI) from the acquired image and depth data, trains and identifies coordinates and space angles serving as features respectively, and explores lip reading research based on three-dimensional information; compared with a model-based method, the method saves more information; compared with a pixel-based method, the method avoids the influence of the background on the extracted data, and reduces the information dimension and redundancy.
A lip reading research method based on Kinect vision specifically comprises the following steps:
the method comprises the following steps: acquiring required three-dimensional face data through a Kinect, and preprocessing the data;
step two: positioning a lip region, extracting 18 characteristic points of the lip, and performing numbering modeling;
step three: extracting characteristics which are respectively angle characteristics among the characteristic points and coordinate characteristics of the characteristic points, and carrying out normalization processing;
step four: and training and recognizing the features by using a Hidden Markov Model (HMM) and a K-nearest neighbor algorithm (KNN).
Preprocessing acquired data, and specifically comprises the following steps:
and acquiring a corpus through collection, and sequencing the acquired data by labeling and time stamps, and storing all the data into a binary file. Preprocessing original data, wherein firstly, the whole section of audio is cut into a section of audio of each word, and corresponding color images and depth data are synchronously stored in the same position; secondly, the unqualified data needs to be removed and re-recorded.
Step two, lip area positioning is carried out on the basis of the step one, and the specific steps are as follows:
in the data acquisition stage, three-dimensional coordinate information of 121 feature points predefined by the face of a speaking person is obtained by using a Kinect sensor. Through a large number of experiments on the data, the serial numbers of 121 feature points of the face corresponding to the 18 feature points of the lip are obtained. And extracting and obtaining the three-dimensional coordinate information of all the characteristic points of the lip corresponding to each frame of image of each word according to the position relation of the 18 points.
Step three, extracting features on the basis of the step one and the step two, and specifically comprises the following steps:
there are two features that can be selected, corresponding to different selection and normalization methods. Firstly, analogy is carried out with a traditional feature extraction mode, angles among feature points are used as features, and the features are obtained by selecting KNN (K-nearest neighbor); secondly, the coordinates of 18 characteristic points of the lip are directly used as the characteristics, and training and recognition are carried out after normalization.
Step four, training and identifying the normalized data on the basis of the step one, the step two and the step three, and specifically comprising the following steps of:
training and recognizing according to the proportion of the training set to the test set being 3:2 and a full-training and full-recognition mode, wherein the training and recognizing mode adopts KNN algorithm classification and HMM model training and recognizing. By comparing the recognition result with the labels of the test set, the recognition rate can be obtained.
A lip reading research method based on Kinect vision uses three-dimensional coordinates and space angles as features to conduct training and recognition, and compared with the existing method, recognition accuracy is improved.
First, lip reading research is of great help to assist in perfecting speech recognition systems. Compared with a voice recognition system of an audio single channel, the recognition system combined with video channel information can be better represented in a complex environment and is more robust. Such as: the consonant letter/p/and/k/, the voiced consonant letter/b/and/d/and the like are difficult to distinguish depending on the audio channel, but the recognition accuracy of the system is greatly improved after the lip feature is combined.
Secondly, in the field of practical application, lip reading research can also assist in the aspects of personal identification, sign language identification and the like. Meanwhile, lip reading plays an important role in the recovery of speaking function of the acquired deaf-mute. In the present generation, mouth shape linguistic analytics also provides technical support in assisting criminal investigation and anti-terrorism. Through the research on the lip movement law, the lip reading technology also plays an important role in the aspects of speaker recognition, lip movement synthesis, voice-driven face image coding, voice-driven head movement synthesis and the like, and the corresponding application fields comprise coding, animated lip movement synthesis, virtual people in network automatic agent, mouth shape matching in dubbing and the like.
Finally, in terms of theoretical studies. It relates to a plurality of fields such as pattern recognition, computer vision, natural language processing, image processing and the like. The contents of a plurality of research fields are mutually promoted and also mutually examined and developed.
Drawings
FIG. 1 is a schematic flow chart of the lip reading study method of the present invention;
FIG. 2 is a schematic diagram of a Kinect three-dimensional coordinate system;
FIG. 3 is a schematic diagram of 121 feature points of a human face;
fig. 4 is a schematic diagram of lip region feature point renumbering.
Detailed Description
The invention is further described with reference to the following figures and examples, but the scope of the invention is not limited thereto.
A method for lip reading research based on Kinect vision, as shown in fig. 1, is a schematic process diagram of a specific embodiment of the present invention, and includes:
step S0101: the data in the preprocessing is three-dimensional data acquired by using Kinect tracking, and a three-dimensional tracking result output by a Kinect coordinate system is as follows: (x, y, z). Wherein the Z-axis is the distance from the sensor to the user, the Y-axis is pointing up and down, and the X-axis is pointing left and right. The data is measured in meters. The coordinate system is shown in fig. 2.
The three-dimensional coordinates of the human face at a certain moment are obtained by processing the color image and the depth image at the same moment. The Face Tracking SDK defines 121 feature points in advance on the Face, including all contour features of the Face, wherein there are 18 feature points in the lip region, as shown in fig. 3.
Step S0201: after obtaining the 3D data of the lip region, 3D modeling is performed on the lip region, and 18 feature points are renumbered first, with the clockwise order of the inner lip being 1 to 8 and the clockwise order of the outer lip being 9 to 18, as shown in fig. 4.
Step S0202: since the lip information obtained by us is three-dimensional coordinates, the superiority is hardly reflected in a two-dimensional image. Therefore, the three-dimensional lip contour model is obtained by performing three-dimensional modeling on the lip region by using MATLAB according to the three-dimensional coordinates of the lip region.
Step S0301: in the angular feature extraction, the contour of the lip has 18 coordinate points, and if angles are selected as features, the credibility of each angle, namely the recognition rate of each angle, needs to be analyzed. And (4) obtaining 2448 angles by arranging and combining the 18 characteristic points, classifying by using a KNN method to obtain the classification accuracy of each word, accordingly obtaining the credibility descending order of the angles, and selecting the high credibility angles as angle characteristics for extraction.
Step S0302: in the angle feature normalization process, the feature matrix of each word is nxl, where N is the number of picture frames corresponding to the corresponding word, and L is the feature dimension corresponding to each frame of image, i.e., the number of selected angles. The feature matrix is normalized, i.e., the data of each dimension is normalized to the interval [ -1,1 ]. Here, the cosine value is directly calculated for each angle to obtain the normalized result of the interval [ -1,1 ].
Step S0303: in the three-dimensional coordinate feature extraction, three-dimensional information of 18 feature points of the corresponding lips of each frame of image, namely for the features of three-dimensional data, the 18 points are analyzed and extracted. Considering that the three-dimensional coordinates of 18 points are information, we directly obtain the feature information of a frame of image by splicing the three-dimensional coordinates of 18 points.
Let the coordinates of 18 points be [ xi, yi, zi ], where i =1,2, …, 18. The resulting features after stitching are then [ x1, x2, …, x18, y1, y2, …, y18, z1, z2, …, z18 ]. Then, the three-dimensional coordinates corresponding to all the image frames of a word are spliced according to the method, and then a characteristic matrix of the three-dimensional data corresponding to the word is obtained, namely N is multiplied by 18, wherein N is the number of the image frames corresponding to the word. For example, the term "operation" of a person, where the number of frames of an image corresponding to the whole audio is 51, means that the feature matrix size obtained corresponding to 51 frames of three-dimensional data information is 51 × 18.
Step S0304: in the three-dimensional coordinate feature normalization process, all three-dimensional data are translated, and are displayed at the same position on a coordinate axis. Specifically, the 9 th feature point of each frame of the corresponding data, namely the left outer lip end point, is unified, namely the right mouth corner of the human. Then we rotate the image, i.e. the line connecting the left and right mouth corners, i.e. the line connecting feature point 9 and feature point 15, is taken as the x-axis, and the other points are correspondingly rotated. The rotated coordinates are used as the final coordinate feature.
Step S0401: after a KNN classifier is trained, classifying a test set, wherein the test set also has the same requirement, and needs to be spliced into a feature matrix like a training set and has the same columns, namely the dimensionality of each sample is kept consistent.
In classifying the test set, the required data includes:
1) the test set is in the same format as the training set and is provided with corresponding labels so as to obtain the accuracy of classification after classification;
2) k value, namely, during classification, considering the category attributes of k neighbors around the point to be classified, and then taking the category occupying the most neighbors as the category of the point;
3) the classification rules, i.e., the distance-measuring pattern of KNN, are commonly used as euclidean distance, manhattan distance, hamming distance, etc.
After obtaining the data, the classification result of each sample in the test set is obtained, and the accuracy of classification can be obtained by comparing the classification result with the label of the test set.
Step S0402: the specific steps for training the HMM model are as follows:
first, a data set is acquired. Firstly, three fifths of data are randomly extracted from a feature file (the feature file with the designated angle obtained after the angle is selected by KNN in the feature extraction process) to be used as a training sample (the other two fifths are test sets), and a label (labeled class name) is attached to the training sample.
Second, training the HMM model. After the data set is obtained, the HMM model is obtained through training for the training set of the same category. Some parameters of the HMM need to be defined in this process: the number of states comprising the HMM model, the number of gaussian mixture portions. After the number of states and the gaussian mixture parameters are given, the HMM model can be initialized and then iteratively updated to obtain the HMM model of the word.
Step S0403: the recognition process is namely the decoding problem in three basic problems of the HMM, and given a test set, i.e. an observation sequence known by us, searching the most probable hidden state sequence in the HMM model, we can obtain a probability, and by comparing the probabilities obtained by all HMM models respectively, select the HMM model with the highest probability as the recognition result. By comparing the recognition result with the label of the test set, the recognition rate can be obtained.

Claims (5)

1. A lip reading research method based on Kinect vision is characterized by comprising the following steps: the method specifically comprises the following steps:
the method comprises the following steps: acquiring required three-dimensional face data through a Kinect, and preprocessing the data;
step two: positioning a lip region, extracting 18 characteristic points of the lip, and performing numbering modeling;
step three: extracting characteristics which are respectively angle characteristics among the characteristic points and coordinate characteristics of the characteristic points, and carrying out normalization processing;
step four: and training and identifying the features by adopting a hidden Markov model and a K-nearest neighbor algorithm.
2. The method of Kinect vision-based lip reading research as claimed in claim 1, wherein: preprocessing acquired data, and specifically comprises the following steps:
acquiring a corpus through collection, and sequencing acquired data tags and timestamps, wherein all the acquired data tags and timestamps are stored in a binary file; preprocessing original data, wherein firstly, the whole section of audio is cut into a section of audio of each word, and corresponding color images and depth data are synchronously stored in the same position; secondly, the unqualified data needs to be removed and re-recorded.
3. The method of Kinect vision-based lip reading research as claimed in claim 1, wherein: step two, lip area positioning is carried out on the basis of the step one, and the specific steps are as follows:
in the data acquisition stage, three-dimensional coordinate information of 121 feature points predefined by the face of the speaker is obtained by using a Kinect sensor; through a large number of experiments on the data, the serial numbers of 121 feature points of the face corresponding to 18 feature points of the lip are obtained; and extracting and obtaining the three-dimensional coordinate information of all the characteristic points of the lip corresponding to each frame of image of each word according to the position relation of the 18 points.
4. The method of Kinect vision-based lip reading research as claimed in claim 1, wherein: step three, extracting features on the basis of the step one and the step two, and specifically comprises the following steps:
two characteristics can be selected and correspond to different selection and normalization methods; firstly, analogy is carried out with a traditional feature extraction mode, angles among feature points are used as features, and the features are obtained by selecting a K-nearest neighbor method; secondly, the coordinates of 18 characteristic points of the lip are directly used as the characteristics, and training and recognition are carried out after normalization.
5. The method of Kinect vision-based lip reading research as claimed in claim 1, wherein: step four, training and identifying the normalized data on the basis of the step one, the step two and the step three, and specifically comprising the following steps of:
training and recognizing according to the proportion of a training set to a test set being 3:2 and a full-training and full-recognition mode, wherein the training and recognizing mode adopts K-nearest neighbor algorithm classification and HMM model training and recognizing; by comparing the recognition result with the labels of the test set, the recognition rate can be obtained.
CN201811357055.2A 2018-11-15 2018-11-15 Lip reading research method based on Kinect vision Pending CN111191490A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811357055.2A CN111191490A (en) 2018-11-15 2018-11-15 Lip reading research method based on Kinect vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811357055.2A CN111191490A (en) 2018-11-15 2018-11-15 Lip reading research method based on Kinect vision

Publications (1)

Publication Number Publication Date
CN111191490A true CN111191490A (en) 2020-05-22

Family

ID=70710612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811357055.2A Pending CN111191490A (en) 2018-11-15 2018-11-15 Lip reading research method based on Kinect vision

Country Status (1)

Country Link
CN (1) CN111191490A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239902A (en) * 2021-07-08 2021-08-10 中国人民解放军国防科技大学 Lip language identification method and device for generating confrontation network based on double discriminators

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025439A (en) * 2017-03-22 2017-08-08 天津大学 Lip-region feature extraction and normalization method based on depth data
CN108109614A (en) * 2016-11-24 2018-06-01 广州映博智能科技有限公司 A kind of new robot band noisy speech identification device and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108109614A (en) * 2016-11-24 2018-06-01 广州映博智能科技有限公司 A kind of new robot band noisy speech identification device and method
CN107025439A (en) * 2017-03-22 2017-08-08 天津大学 Lip-region feature extraction and normalization method based on depth data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
岳帅: "基于Kinect三维视觉的实时唇读技术研究" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239902A (en) * 2021-07-08 2021-08-10 中国人民解放军国防科技大学 Lip language identification method and device for generating confrontation network based on double discriminators
CN113239902B (en) * 2021-07-08 2021-09-28 中国人民解放军国防科技大学 Lip language identification method and device for generating confrontation network based on double discriminators

Similar Documents

Publication Publication Date Title
Ong et al. Automatic sign language analysis: A survey and the future beyond lexical meaning
Von Agris et al. Recent developments in visual sign language recognition
Liu et al. Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition
CN110992783A (en) Sign language translation method and translation equipment based on machine learning
CN102930270A (en) Method and system for identifying hands based on complexion detection and background elimination
Yuan et al. Large scale sign language interpretation
CN113435421B (en) Cross-modal attention enhancement-based lip language identification method and system
CN112749646A (en) Interactive point-reading system based on gesture recognition
Shinde et al. Advanced Marathi sign language recognition using computer vision
Luqman An efficient two-stream network for isolated sign language recognition using accumulative video motion
Ivanko et al. Automatic lip-reading of hearing impaired people
Pujari et al. A survey on deep learning based lip-reading techniques
KR20190121593A (en) Sign language recognition system
CN111191490A (en) Lip reading research method based on Kinect vision
Goel et al. Real-time sign language to text and speech translation and hand gesture recognition using the LSTM model
Naert et al. Per channel automatic annotation of sign language motion capture data
Zheng et al. Review of lip-reading recognition
Bin Munir et al. A machine learning based sign language interpretation system for communication with deaf-mute people
CN115171673A (en) Role portrait based communication auxiliary method and device and storage medium
Zahedi Robust appearance based sign language recognition
Nemani et al. Speaker independent VSR: A systematic review and futuristic applications
Moustafa et al. Arabic Sign Language Recognition Systems: A Systematic Review
Bhukhya et al. Virtual Assistant and Navigation for Visually Impaired using Deep Neural Network and Image Processing
Tan et al. An End-to-End Air Writing Recognition Method Based on Transformer
Liu et al. A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination