CN112699236A

CN112699236A - Deepfake detection method based on emotion recognition and pupil size calculation

Info

Publication number: CN112699236A
Application number: CN202011532434.8A
Authority: CN
Inventors: 刘毅; 王鹏程; 陈晋音
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-23
Anticipated expiration: 2040-12-22
Also published as: CN112699236B

Abstract

The invention discloses a Deepfake detection method based on emotion recognition and pupil size calculation, which comprises the following steps of: (1) dividing the voice data into a training set X and a testing set Q, then carrying out data processing, and training and testing a training voice recognition model Y; (2) dividing text data into a training set N and a testing set P, then performing data processing, and training and testing a training text emotion classification model M; (3) extracting audio from a Deepfake video to be detected, inputting the audio into a voice recognition model Y, and inputting an output text into a text emotion classification model M to obtain an emotion corresponding to the text; (4) converting the Deepfake video to be detected into a picture frame, and detecting the size of the pupil of the human eye; (5) and matching the detected pupil size of the human eye with the emotion obtained by the text emotion classification model M, and judging the video to be a false video if the detected pupil size of the human eye is not matched with the emotion obtained by the text emotion classification model M. The method can better detect the false videos generated by different methods, and has strong generalization capability.

Description

Deepfake detection method based on emotion recognition and pupil size calculation

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a Deepfake detection method based on emotion recognition and pupil size calculation.

Background

The speech recognition technology is to make a computer understand what a person says so as to realize speech communication between a human and a machine, and also to output the words spoken by the human in a text form. In recent years, speech recognition technology has significantly advanced, and is beginning to move from laboratories to life of people, such as speech assistants, speech translation and the like in smart phones. Common methods used in speech recognition technology are stochastic modeling, probabilistic parsing, linguistic and acoustic based methods, and methods using artificial neural networks, among which stochastic modeling is most commonly used.

For example, chinese patent publication No. CN106792140A discloses a broadcast television advertisement monitoring system based on voice recognition, which performs model construction on sample voice and characteristic values of voice to be recognized through a voice recognition modeling module, so as to ensure accuracy of recognition between the characteristics to be recognized and template characteristics; the matched sound is quantized through the sound matching module, so that the matching accuracy is improved; the recognition method adopted by the voice recognition modeling module comprises a template matching method and a random model method.

The text sentiment analysis is to analyze a text with subjective sentiment colors to obtain corresponding sentiment attribution of the text, a large number of comments of a user to an event, a character, a product and the like exist on the Internet, the comments comprise the sentiment tendency of the user, and the opinion of the public on the event, the character, the product and the like can be analyzed through the text sentiment analysis. According to the processed fine-grained different text emotion analysis, the method can be divided into three research levels, namely word level, sentence level and chapter level. The invention uses sentence-level text emotion analysis.

The pupil size of normal people is related to emotional state, the pupil expansion and contraction are controlled by smooth muscle, the smooth muscle is controlled by autonomic nerve, the human consciousness can not change, a person has a way to control own behavior, language and action, or has no way to control own pupil, especially the slight change of the pupil has no way to control. Psychological studies have shown that a person's pupil size reflects his current emotional state, expanding 4 to 5 times as much as one would like to feel happy or excited, and contracting involuntarily little as one would feel angry or bored.

At present, with the advent of the Deepfake technology, people have difficulty in distinguishing some false videos or pictures with naked eyes, and some false pictures or videos which have great social influence exist on the network. For example, changing faces of some public people to allow them to disseminate false talk or other people with malicious defamation. Therefore, the detection of these false pictures or videos is very important, but the current Deepfake technology has some defects that the details of some faces are not forged enough, such as the size change of the pupil, the scaling of the pore, etc.

Disclosure of Invention

The invention provides a Deepfake detection method based on emotion recognition and pupil size calculation, which can solve the problems that the application scene of the existing Deepfake detection technology is not comprehensive enough, overfitting to a certain Deepfake mode is often caused, and the generalization capability is lacked.

A Deepfake detection method based on emotion recognition and pupil size calculation comprises the following steps:

(1) selecting a corpus of voice data, dividing the voice data into a voice training set X and a voice test set Q, then performing voice data processing, and training and testing a training voice recognition model Y;

(2) selecting a corpus of text data, dividing the text data into a text training set N and a text test set P, then processing the text data, and training and testing a training text emotion classification model M;

(3) extracting audio from a video of a Deepfake video to be detected, processing the audio by data, inputting the processed audio into a voice recognition model Y, outputting a corresponding text by the voice recognition model Y, processing the output text by data, and inputting the processed text into a text emotion classification model M to obtain an emotion corresponding to the text;

(4) converting a Deepfake video to be detected into a picture frame, extracting a face part in the picture frame, and detecting the size of a pupil of a human eye;

(5) matching the detected pupil size of the human eye with the emotion obtained by the text emotion classification model M, and if the detected pupil size of the human eye is not matched with the emotion obtained by the text emotion classification model M, judging that the video is a false video; if so, the video is judged to be true.

In step (1), the corpus of speech data adopts a CASIA Chinese emotion corpus, and the speech data processing comprises:

and filtering the voice training set X and the voice test set Q to remove noise, and then extracting voice characteristic parameters MFCC from the voice training set X.

The speech recognition model Y adopts a hundred-degree open source Deepseech 2 model, the training loss function adopts a connection-ambiguity time classification algorithm CTC, and CTCLOs is defined as follows

CTCLoss(f(x),T)＝-logP(T|f(x))

Where, y ═ f (x) is the probability distribution of the output character, and T is the corresponding text.

In the step (2), the corpus of the text data adopts a Chinese microblog data set of NLPCC2013, and the text data processing comprises the following steps:

for text data in a voice training set X and a voice testing set Q, a corpus is converted into word vectors, word vectors are trained by adopting word2vec of Google, mapping from the words to the word vectors is established after the word vectors are trained, and word vector coding is carried out on the text through an Embedding function of keras.

The text emotion classification model M adopts a convolution network, the size of a convolution kernel is 3 multiplied by 3, the step length is 1, batch normalization is added into a convolution layer and a maximum pooling layer, and the normalization is input into an activation function, wherein the adopted activation function is a ReLU, after the characteristics are extracted through two-dimensional separable convolution, the extracted characteristics are input into a GRU layer, and after the GRU, the GRU is input into a full connection layer, and the classification is carried out by adopting a softmax classifier;

the loss function loss of model training adopts a cross entropy form, and the formula is as follows:

where M is the number of classes, y indicates a variable of 0 or 1, 1 if the class is the same as the class of the sample, otherwise 0, p_cIs the predicted probability that the observed sample belongs to class c.

The specific process of the step (4) is as follows:

converting the Deepfake video to be detected into a frame-by-frame picture by using OpenCV;

extracting the human face in the picture by using a dlib tool, detecting key points of human eyes, and segmenting the human eyes;

median filtering is carried out on the human eye picture, and the noise of normal distribution is filtered out by adopting a 7 multiplied by 7 filtering template; carrying out threshold processing on the image to obtain a black and white picture with different contrast; then carrying out edge detection on the picture;

performing freeman chain code coding on the boundary information after the image edge detection to extract the edge in the image, and identifying the pupil boundary according to the edge characteristics;

after the pupil boundary is identified, the size of the pupil is calculated.

Furthermore, the freeman chain code adopts an 8-communication chain code.

Further, fitting and size calculation of the pupil are performed by adopting a Hough circle fitting method, which specifically comprises the following steps: the image space is converted into a parameter space, then the center of a circle is detected, and the radius of the circle is deduced from the center of the circle, so that the detection of the size of the pupil is completed.

The invention converts the voice of the person speaking into the text, classifies the text in emotion to obtain the emotional state of the person speaking, and compares the size of the pupil to judge the true and false video.

The invention has the following beneficial effects:

the method is used for detecting the physiological characteristics of human eyes, can better detect the false videos generated by different Deepfake methods, and has strong generalization capability and wide application range.

Drawings

FIG. 1 is an overall method flow diagram in an embodiment of the invention;

FIG. 2 is a flow diagram of speech recognition in an embodiment of the present invention;

FIG. 3 is a flow diagram of textual emotion analysis in an embodiment of the present invention;

FIG. 4 is a flow chart of pupil size calculation according to an embodiment of the present invention;

FIG. 5 is a diagram of a speech recognition model Y in accordance with an embodiment of the present invention;

FIG. 6 is a diagram of a text emotion classification model M according to an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

As shown in fig. 1, a method for deteffake detection based on emotion recognition and pupil size calculation includes:

step 1, data processing

(1-1) data set

The CASIA Chinese emotion corpus is used as a training data set of the voice recognition model Y, the CASIA Chinese emotion corpus is recorded by Institute of Automation of Automation of Chinese Academy of Sciences, and comprises four professional pronunciations, six emotional vitality (anger), happy (happy), fear (fear), sadness (sad), surprise (surpride) and neutral (neutral), and 9600 sentences of different pronunciations. Of which 300 sentences are of the same text and 100 sentences are of different text. What is the same text is that four professional pronunciations read with different 6 emotions, for a total of 300 × 4 × 6 ═ 7200 sentences. Different texts refer to emotions in sentences which can be seen from the literal meaning, and the total of 100 × 4 × 6 is 2400 sentences, and the process of speech recognition is shown in fig. 2.

The Chinese microblog data set of NLPCC2013 is used as a training data set of a text emotion classification model M, the corpus is mainly used for identifying emotions expressed by the whole microblog, the emotion is not simply accepted and dislocated classification, but relates to a plurality of fine-grained emotion classes (such as sadness, worry, happiness, excitement and the like), and belongs to the fine-grained emotion classification problem, and the text emotion classification process is shown in FIG. 3.

(1-2) dividing training set and test set

The invention trains the speech recognition model Y only needs to use the pronunciation of different texts in the CASIA Chinese emotion corpus, 2400 sentences are shared, 100 sentences of different texts are divided into 4: 1 into a training set X and a test set Q, wherein the training set X has 1920-sentence reading voices with 80 different texts, and the test set Q has 480-sentence reading voices with 20 different texts. The training set X is used for training the voice recognition model Y, and the testing set Q is used for testing the recognition accuracy of the voice recognition model Y.

The Chinese microblog data set of the NLPCC2013 is divided into 4 parts: the scale of 1 is divided into a training set N and a test set P.

(1-3) processing the data set

The training set X and the test set Q are voice data, and noise is removed by filtering the voice data. The training set X is then subjected to extraction of the speech feature parameters MFCC, i.e. Mel-Frequency Cepstrum, which is a short power spectrum of sound that transforms the audio signal by simulating the human auditory perception in a series of steps. The process comprises the following steps: the method comprises the steps of performing pre-emphasis, framing and windowing on a voice, then obtaining a corresponding frequency spectrum for each short time analysis window through Fast Fourier Transform (FFT), converting a linear natural frequency spectrum into a Mel frequency spectrum reflecting human auditory characteristics through the Mel frequency spectrum, and finally performing cepstrum analysis on the Mel frequency spectrum to obtain a Mel frequency cepstrum coefficient MFCC which is the characteristic of the frame of voice.

The Chinese microblog data set of NLPCC2013 is a corpus, the input of a text emotion classification model M is word vectors, a corpus is converted into word vectors, a word segmentation tool, namely a jieba word segmentation is used for segmenting the corpus and removing stop words, then word vectors are trained by using word2vec of Google, the word2vec is a tool package which is derived by Google in 2013 and used for obtaining word vectors, mapping from the words to the word vectors is established after the word vectors are trained, word vector coding is carried out on the text through an Embedding function of a key, and a training set N and a test set P are obtained.

Step 2, training the model

And (2-1) training the speech recognition model Y by taking the speech training set X divided in the step 1 as input.

The speech recognition model Y adopts a centesimal open-source Deepseech 2 model, and the Deepseech 2 model is based on a centesimal Paddle framework, and is powerful in function, simple and easy to use. The structure of the model is shown in fig. 5, and the model is composed of three parts, the first part is 2D invariant convolution (2D invariant convolution), and the second part is a gated cyclic unit structure gru (gated recurrentunit), which is a variant of the conventional RNN, and can effectively capture semantic association between long sequences as LSTM, and alleviate gradient disappearance or explosion phenomena. Meanwhile, the structure and calculation of the method are simpler than those of the LSTM, the last part is a Fully Connected Layer (full Connected Layer), and the output is shaped by adopting the Fully Connected Layer. Logits are obtained to compute the CTC loss function and decode. Batch Normalization (Batch Normalization) is used for each layer of input in the model to reduce the distribution gap between input and output, increase the generalization capability of the model and accelerate training. The input of the model is a spectrogram (spectrogram) of a power normalized audio clip, and the output is simplified Chinese.

The loss function employs a connection-dominant Temporal Classification (CTC) algorithm. The main advantage is that the data which is not aligned can be automatically aligned, and the method is mainly used for training serialized data which is not aligned in advance, such as speech recognition. The loss function CTCLOs for CTC may be interpreted as the sum of the probabilities that the correct stable is output after a given sample. CTCLOs is defined as follows:

CTCLoss(f(x),T)＝-logP(T|f(x))

where y ═ f (x) is the probability distribution of the output character, and T is the corresponding text.

And (2-2) training the text emotion classification model M by taking the training set N divided in the step 1 as input.

The structure of the text emotion classification model M is shown in FIG. 6, the size of a convolution kernel adopted in two-dimensional convolution is 3 x 3, the step length is 1, in order to prevent over-fitting of training and improve the convergence rate of training, Batch Normalization (Batch Normalization) is added in a convolution layer and a maximum pooling layer, and is input into an activation function after Normalization, wherein the adopted activation function is ReLU, after the characteristics are extracted through two-dimensional separable convolution, the extracted characteristics are input into a GRU layer, and are input into a full connection layer after GRU, and classification is carried out by adopting a softmax classifier. The results of the classification are set into four categories, namely four emotions of pleasure, calmness, anxiety and impatience. The loss function loss of the model takes a cross entropy form, and the formula is as follows:

where M is the number of classes, y indicates a variable (0 or 1) which is 1 if the class is the same as the class of the sample, otherwise 0, p_cIs the predicted probability that the observed sample belongs to class c.

Step 3, testing the model

Respectively inputting the test set P and the test set Q divided in the step 1) into a trained voice recognition model Y and a trained text emotion classification model M, and obtaining the classification accuracy of the model and the test set Q according to the output of the model.

Step 4, pupil size calculation

The pupil size calculation needs some digital image processing techniques, OpenCV is adopted to process eye images, OpenCV is a cross-platform computer vision and machine learning software library issued based on BSD licensing (open source), and the vision processing algorithms provided by OpenCV are very rich.

As shown in fig. 4, the pupil size detection process includes firstly converting a video into a frame-by-frame picture by using OpenCV, then extracting a face in the picture by using dlib tool, and then detecting key points of human eyes to segment the human eyes. Then, median filtering is carried out on the human eye picture, a filtering template of 7 multiplied by 7 is adopted to filter noise in normal distribution, then threshold processing is carried out on the picture, the threshold processing is to set a threshold value to carry out binarization processing on the picture, finally, a black and white picture with different contrast is obtained, a one-dimensional maximum entropy threshold value segmentation method is adopted to carry out threshold processing on the picture, then edge detection is carried out on the picture, the purpose of edge detection is to identify points with obvious brightness change in the digital picture, and a Prewitt operator (Prewitt operator) is adopted to carry out edge detection, wherein the Prewitt operator utilizes the gray difference of upper, lower, left and right adjacent points of a pixel point to achieve extreme value edge detection at the edge. The freeman chain code coding is a method for describing a curve or a boundary by using coordinates of a curve starting point and a boundary point direction code, and is often used for representing the boundary of the curve and a region in the fields of image processing, computer graphics, mode recognition and the like. The adopted 8-connected chain codes have four adjacent points which are respectively arranged at the upper part, the upper right part, the lower left part and the upper left part of the central point. The 8-connected chain code is consistent with the actual pixel point, and the information of the central pixel point and the adjacent point can be accurately described. After the freeman chain code coding is carried out to extract the edge in the image, the pupil boundary can be identified according to the characteristics of the image, and after the pupil boundary is identified, the fitting and the size calculation of the pupil are carried out by adopting a Hough circle fitting method. The principle of standard Hough transform is to convert the image space into a parameter space, then to detect the center of a circle, and to deduce the radius of the circle from the center of the circle. This completes the detection of the pupil size.

Step 5, judging whether the video is true or false

The method aims at detection of a Deepfake video, so that a detection sample is a video, moviepy is used for extracting audio from the video, the audio is input into a voice recognition model Y after being processed, the voice recognition model Y outputs a corresponding text, the output text is input into a text emotion classification model M after being processed, emotion corresponding to the text is obtained, then the size of pupils of human eyes in the video is detected, the size of the pupils of the human eyes is matched with the corresponding emotion or not by combining a table 1, and if the size of the pupils of the human eyes is not matched, the video is judged to be a false video.

TABLE 1

Emotional state	Pupil size (unit: mm)
		Pleasure	5.34±1.41
Quiet	3.50±1.25
		Anxiety disorder	3.17±0.86
Impatience	4.91±1.81

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A Deepfake detection method based on emotion recognition and pupil size calculation is characterized by comprising the following steps:

2. The method for detepfake detection based on emotion recognition and pupil size calculation according to claim 1, wherein in step (1), the corpus of speech data adopts a CASIA chinese emotion corpus, and the speech data processing includes:

3. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 2, wherein in step (1), said speech recognition model Y is a Deepspeed 2 model with Baidu open source, and the trained loss function is a connected-ambiguity time classification algorithm (CTC) which is a CTC_LossIs defined as follows

CTC_Loss(f(x),T)＝-logP(T|f(x))

4. The method for deteffake detection based on emotion recognition and pupil size calculation as claimed in claim 1, wherein in step (2), the corpus of text data adopts the chinese microblog data set of NLPCC2013, and the text data processing comprises:

5. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 4, wherein in step (2), the text emotion classification model M adopts convolution network, the convolution kernel size is 3 × 3, the step size is 1, batch normalization is added to the convolution layer and the maximum pooling layer, and then the normalized results are input into the activation function, wherein the activation function adopted is ReLU, after the features are extracted by two-dimensional separable convolution, the extracted features are input into the GRU layer, after GRU, the extracted features are input into the full link layer, and finally classification is performed by softmax classifier.

6. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 5, wherein the loss function loss trained by the text emotion classification model M is in a cross entropy form, and the formula is as follows:

7. The method for detecting Deepfake based on emotion recognition and pupil size calculation as claimed in claim 1, wherein the specific process of step (4) is:

after the pupil boundary is identified, the size of the pupil is calculated.

8. The method of claim 7, wherein the freeman chain code is an 8-way chain code.

9. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 7, wherein fitting and size calculation of the pupil are performed by using a Hough circle fitting method, specifically: the image space is converted into a parameter space, then the center of a circle is detected, and the radius of the circle is deduced from the center of the circle, so that the detection of the size of the pupil is completed.