CN112699236A - Deepfake detection method based on emotion recognition and pupil size calculation - Google Patents

Deepfake detection method based on emotion recognition and pupil size calculation Download PDF

Info

Publication number
CN112699236A
CN112699236A CN202011532434.8A CN202011532434A CN112699236A CN 112699236 A CN112699236 A CN 112699236A CN 202011532434 A CN202011532434 A CN 202011532434A CN 112699236 A CN112699236 A CN 112699236A
Authority
CN
China
Prior art keywords
text
emotion
voice
pupil
deepfake
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011532434.8A
Other languages
Chinese (zh)
Other versions
CN112699236B (en
Inventor
刘毅
王鹏程
陈晋音
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202011532434.8A priority Critical patent/CN112699236B/en
Publication of CN112699236A publication Critical patent/CN112699236A/en
Application granted granted Critical
Publication of CN112699236B publication Critical patent/CN112699236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Ophthalmology & Optometry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Quality & Reliability (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a Deepfake detection method based on emotion recognition and pupil size calculation, which comprises the following steps of: (1) dividing the voice data into a training set X and a testing set Q, then carrying out data processing, and training and testing a training voice recognition model Y; (2) dividing text data into a training set N and a testing set P, then performing data processing, and training and testing a training text emotion classification model M; (3) extracting audio from a Deepfake video to be detected, inputting the audio into a voice recognition model Y, and inputting an output text into a text emotion classification model M to obtain an emotion corresponding to the text; (4) converting the Deepfake video to be detected into a picture frame, and detecting the size of the pupil of the human eye; (5) and matching the detected pupil size of the human eye with the emotion obtained by the text emotion classification model M, and judging the video to be a false video if the detected pupil size of the human eye is not matched with the emotion obtained by the text emotion classification model M. The method can better detect the false videos generated by different methods, and has strong generalization capability.

Description

Deepfake detection method based on emotion recognition and pupil size calculation
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a Deepfake detection method based on emotion recognition and pupil size calculation.
Background
The speech recognition technology is to make a computer understand what a person says so as to realize speech communication between a human and a machine, and also to output the words spoken by the human in a text form. In recent years, speech recognition technology has significantly advanced, and is beginning to move from laboratories to life of people, such as speech assistants, speech translation and the like in smart phones. Common methods used in speech recognition technology are stochastic modeling, probabilistic parsing, linguistic and acoustic based methods, and methods using artificial neural networks, among which stochastic modeling is most commonly used.
For example, chinese patent publication No. CN106792140A discloses a broadcast television advertisement monitoring system based on voice recognition, which performs model construction on sample voice and characteristic values of voice to be recognized through a voice recognition modeling module, so as to ensure accuracy of recognition between the characteristics to be recognized and template characteristics; the matched sound is quantized through the sound matching module, so that the matching accuracy is improved; the recognition method adopted by the voice recognition modeling module comprises a template matching method and a random model method.
The text sentiment analysis is to analyze a text with subjective sentiment colors to obtain corresponding sentiment attribution of the text, a large number of comments of a user to an event, a character, a product and the like exist on the Internet, the comments comprise the sentiment tendency of the user, and the opinion of the public on the event, the character, the product and the like can be analyzed through the text sentiment analysis. According to the processed fine-grained different text emotion analysis, the method can be divided into three research levels, namely word level, sentence level and chapter level. The invention uses sentence-level text emotion analysis.
The pupil size of normal people is related to emotional state, the pupil expansion and contraction are controlled by smooth muscle, the smooth muscle is controlled by autonomic nerve, the human consciousness can not change, a person has a way to control own behavior, language and action, or has no way to control own pupil, especially the slight change of the pupil has no way to control. Psychological studies have shown that a person's pupil size reflects his current emotional state, expanding 4 to 5 times as much as one would like to feel happy or excited, and contracting involuntarily little as one would feel angry or bored.
At present, with the advent of the Deepfake technology, people have difficulty in distinguishing some false videos or pictures with naked eyes, and some false pictures or videos which have great social influence exist on the network. For example, changing faces of some public people to allow them to disseminate false talk or other people with malicious defamation. Therefore, the detection of these false pictures or videos is very important, but the current Deepfake technology has some defects that the details of some faces are not forged enough, such as the size change of the pupil, the scaling of the pore, etc.
Disclosure of Invention
The invention provides a Deepfake detection method based on emotion recognition and pupil size calculation, which can solve the problems that the application scene of the existing Deepfake detection technology is not comprehensive enough, overfitting to a certain Deepfake mode is often caused, and the generalization capability is lacked.
A Deepfake detection method based on emotion recognition and pupil size calculation comprises the following steps:
(1) selecting a corpus of voice data, dividing the voice data into a voice training set X and a voice test set Q, then performing voice data processing, and training and testing a training voice recognition model Y;
(2) selecting a corpus of text data, dividing the text data into a text training set N and a text test set P, then processing the text data, and training and testing a training text emotion classification model M;
(3) extracting audio from a video of a Deepfake video to be detected, processing the audio by data, inputting the processed audio into a voice recognition model Y, outputting a corresponding text by the voice recognition model Y, processing the output text by data, and inputting the processed text into a text emotion classification model M to obtain an emotion corresponding to the text;
(4) converting a Deepfake video to be detected into a picture frame, extracting a face part in the picture frame, and detecting the size of a pupil of a human eye;
(5) matching the detected pupil size of the human eye with the emotion obtained by the text emotion classification model M, and if the detected pupil size of the human eye is not matched with the emotion obtained by the text emotion classification model M, judging that the video is a false video; if so, the video is judged to be true.
In step (1), the corpus of speech data adopts a CASIA Chinese emotion corpus, and the speech data processing comprises:
and filtering the voice training set X and the voice test set Q to remove noise, and then extracting voice characteristic parameters MFCC from the voice training set X.
The speech recognition model Y adopts a hundred-degree open source Deepseech 2 model, the training loss function adopts a connection-ambiguity time classification algorithm CTC, and CTCLOs is defined as follows
CTCLoss(f(x),T)=-logP(T|f(x))
Where, y ═ f (x) is the probability distribution of the output character, and T is the corresponding text.
In the step (2), the corpus of the text data adopts a Chinese microblog data set of NLPCC2013, and the text data processing comprises the following steps:
for text data in a voice training set X and a voice testing set Q, a corpus is converted into word vectors, word vectors are trained by adopting word2vec of Google, mapping from the words to the word vectors is established after the word vectors are trained, and word vector coding is carried out on the text through an Embedding function of keras.
The text emotion classification model M adopts a convolution network, the size of a convolution kernel is 3 multiplied by 3, the step length is 1, batch normalization is added into a convolution layer and a maximum pooling layer, and the normalization is input into an activation function, wherein the adopted activation function is a ReLU, after the characteristics are extracted through two-dimensional separable convolution, the extracted characteristics are input into a GRU layer, and after the GRU, the GRU is input into a full connection layer, and the classification is carried out by adopting a softmax classifier;
the loss function loss of model training adopts a cross entropy form, and the formula is as follows:
Figure BDA0002852432930000041
where M is the number of classes, y indicates a variable of 0 or 1, 1 if the class is the same as the class of the sample, otherwise 0, pcIs the predicted probability that the observed sample belongs to class c.
The specific process of the step (4) is as follows:
converting the Deepfake video to be detected into a frame-by-frame picture by using OpenCV;
extracting the human face in the picture by using a dlib tool, detecting key points of human eyes, and segmenting the human eyes;
median filtering is carried out on the human eye picture, and the noise of normal distribution is filtered out by adopting a 7 multiplied by 7 filtering template; carrying out threshold processing on the image to obtain a black and white picture with different contrast; then carrying out edge detection on the picture;
performing freeman chain code coding on the boundary information after the image edge detection to extract the edge in the image, and identifying the pupil boundary according to the edge characteristics;
after the pupil boundary is identified, the size of the pupil is calculated.
Furthermore, the freeman chain code adopts an 8-communication chain code.
Further, fitting and size calculation of the pupil are performed by adopting a Hough circle fitting method, which specifically comprises the following steps: the image space is converted into a parameter space, then the center of a circle is detected, and the radius of the circle is deduced from the center of the circle, so that the detection of the size of the pupil is completed.
The invention converts the voice of the person speaking into the text, classifies the text in emotion to obtain the emotional state of the person speaking, and compares the size of the pupil to judge the true and false video.
The invention has the following beneficial effects:
the method is used for detecting the physiological characteristics of human eyes, can better detect the false videos generated by different Deepfake methods, and has strong generalization capability and wide application range.
Drawings
FIG. 1 is an overall method flow diagram in an embodiment of the invention;
FIG. 2 is a flow diagram of speech recognition in an embodiment of the present invention;
FIG. 3 is a flow diagram of textual emotion analysis in an embodiment of the present invention;
FIG. 4 is a flow chart of pupil size calculation according to an embodiment of the present invention;
FIG. 5 is a diagram of a speech recognition model Y in accordance with an embodiment of the present invention;
FIG. 6 is a diagram of a text emotion classification model M according to an embodiment of the present invention.
Detailed Description
The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.
As shown in fig. 1, a method for deteffake detection based on emotion recognition and pupil size calculation includes:
step 1, data processing
(1-1) data set
The CASIA Chinese emotion corpus is used as a training data set of the voice recognition model Y, the CASIA Chinese emotion corpus is recorded by Institute of Automation of Automation of Chinese Academy of Sciences, and comprises four professional pronunciations, six emotional vitality (anger), happy (happy), fear (fear), sadness (sad), surprise (surpride) and neutral (neutral), and 9600 sentences of different pronunciations. Of which 300 sentences are of the same text and 100 sentences are of different text. What is the same text is that four professional pronunciations read with different 6 emotions, for a total of 300 × 4 × 6 ═ 7200 sentences. Different texts refer to emotions in sentences which can be seen from the literal meaning, and the total of 100 × 4 × 6 is 2400 sentences, and the process of speech recognition is shown in fig. 2.
The Chinese microblog data set of NLPCC2013 is used as a training data set of a text emotion classification model M, the corpus is mainly used for identifying emotions expressed by the whole microblog, the emotion is not simply accepted and dislocated classification, but relates to a plurality of fine-grained emotion classes (such as sadness, worry, happiness, excitement and the like), and belongs to the fine-grained emotion classification problem, and the text emotion classification process is shown in FIG. 3.
(1-2) dividing training set and test set
The invention trains the speech recognition model Y only needs to use the pronunciation of different texts in the CASIA Chinese emotion corpus, 2400 sentences are shared, 100 sentences of different texts are divided into 4: 1 into a training set X and a test set Q, wherein the training set X has 1920-sentence reading voices with 80 different texts, and the test set Q has 480-sentence reading voices with 20 different texts. The training set X is used for training the voice recognition model Y, and the testing set Q is used for testing the recognition accuracy of the voice recognition model Y.
The Chinese microblog data set of the NLPCC2013 is divided into 4 parts: the scale of 1 is divided into a training set N and a test set P.
(1-3) processing the data set
The training set X and the test set Q are voice data, and noise is removed by filtering the voice data. The training set X is then subjected to extraction of the speech feature parameters MFCC, i.e. Mel-Frequency Cepstrum, which is a short power spectrum of sound that transforms the audio signal by simulating the human auditory perception in a series of steps. The process comprises the following steps: the method comprises the steps of performing pre-emphasis, framing and windowing on a voice, then obtaining a corresponding frequency spectrum for each short time analysis window through Fast Fourier Transform (FFT), converting a linear natural frequency spectrum into a Mel frequency spectrum reflecting human auditory characteristics through the Mel frequency spectrum, and finally performing cepstrum analysis on the Mel frequency spectrum to obtain a Mel frequency cepstrum coefficient MFCC which is the characteristic of the frame of voice.
The Chinese microblog data set of NLPCC2013 is a corpus, the input of a text emotion classification model M is word vectors, a corpus is converted into word vectors, a word segmentation tool, namely a jieba word segmentation is used for segmenting the corpus and removing stop words, then word vectors are trained by using word2vec of Google, the word2vec is a tool package which is derived by Google in 2013 and used for obtaining word vectors, mapping from the words to the word vectors is established after the word vectors are trained, word vector coding is carried out on the text through an Embedding function of a key, and a training set N and a test set P are obtained.
Step 2, training the model
And (2-1) training the speech recognition model Y by taking the speech training set X divided in the step 1 as input.
The speech recognition model Y adopts a centesimal open-source Deepseech 2 model, and the Deepseech 2 model is based on a centesimal Paddle framework, and is powerful in function, simple and easy to use. The structure of the model is shown in fig. 5, and the model is composed of three parts, the first part is 2D invariant convolution (2D invariant convolution), and the second part is a gated cyclic unit structure gru (gated recurrentunit), which is a variant of the conventional RNN, and can effectively capture semantic association between long sequences as LSTM, and alleviate gradient disappearance or explosion phenomena. Meanwhile, the structure and calculation of the method are simpler than those of the LSTM, the last part is a Fully Connected Layer (full Connected Layer), and the output is shaped by adopting the Fully Connected Layer. Logits are obtained to compute the CTC loss function and decode. Batch Normalization (Batch Normalization) is used for each layer of input in the model to reduce the distribution gap between input and output, increase the generalization capability of the model and accelerate training. The input of the model is a spectrogram (spectrogram) of a power normalized audio clip, and the output is simplified Chinese.
The loss function employs a connection-dominant Temporal Classification (CTC) algorithm. The main advantage is that the data which is not aligned can be automatically aligned, and the method is mainly used for training serialized data which is not aligned in advance, such as speech recognition. The loss function CTCLOs for CTC may be interpreted as the sum of the probabilities that the correct stable is output after a given sample. CTCLOs is defined as follows:
CTCLoss(f(x),T)=-logP(T|f(x))
where y ═ f (x) is the probability distribution of the output character, and T is the corresponding text.
And (2-2) training the text emotion classification model M by taking the training set N divided in the step 1 as input.
The structure of the text emotion classification model M is shown in FIG. 6, the size of a convolution kernel adopted in two-dimensional convolution is 3 x 3, the step length is 1, in order to prevent over-fitting of training and improve the convergence rate of training, Batch Normalization (Batch Normalization) is added in a convolution layer and a maximum pooling layer, and is input into an activation function after Normalization, wherein the adopted activation function is ReLU, after the characteristics are extracted through two-dimensional separable convolution, the extracted characteristics are input into a GRU layer, and are input into a full connection layer after GRU, and classification is carried out by adopting a softmax classifier. The results of the classification are set into four categories, namely four emotions of pleasure, calmness, anxiety and impatience. The loss function loss of the model takes a cross entropy form, and the formula is as follows:
Figure BDA0002852432930000081
where M is the number of classes, y indicates a variable (0 or 1) which is 1 if the class is the same as the class of the sample, otherwise 0, pcIs the predicted probability that the observed sample belongs to class c.
Step 3, testing the model
Respectively inputting the test set P and the test set Q divided in the step 1) into a trained voice recognition model Y and a trained text emotion classification model M, and obtaining the classification accuracy of the model and the test set Q according to the output of the model.
Step 4, pupil size calculation
The pupil size calculation needs some digital image processing techniques, OpenCV is adopted to process eye images, OpenCV is a cross-platform computer vision and machine learning software library issued based on BSD licensing (open source), and the vision processing algorithms provided by OpenCV are very rich.
As shown in fig. 4, the pupil size detection process includes firstly converting a video into a frame-by-frame picture by using OpenCV, then extracting a face in the picture by using dlib tool, and then detecting key points of human eyes to segment the human eyes. Then, median filtering is carried out on the human eye picture, a filtering template of 7 multiplied by 7 is adopted to filter noise in normal distribution, then threshold processing is carried out on the picture, the threshold processing is to set a threshold value to carry out binarization processing on the picture, finally, a black and white picture with different contrast is obtained, a one-dimensional maximum entropy threshold value segmentation method is adopted to carry out threshold processing on the picture, then edge detection is carried out on the picture, the purpose of edge detection is to identify points with obvious brightness change in the digital picture, and a Prewitt operator (Prewitt operator) is adopted to carry out edge detection, wherein the Prewitt operator utilizes the gray difference of upper, lower, left and right adjacent points of a pixel point to achieve extreme value edge detection at the edge. The freeman chain code coding is a method for describing a curve or a boundary by using coordinates of a curve starting point and a boundary point direction code, and is often used for representing the boundary of the curve and a region in the fields of image processing, computer graphics, mode recognition and the like. The adopted 8-connected chain codes have four adjacent points which are respectively arranged at the upper part, the upper right part, the lower left part and the upper left part of the central point. The 8-connected chain code is consistent with the actual pixel point, and the information of the central pixel point and the adjacent point can be accurately described. After the freeman chain code coding is carried out to extract the edge in the image, the pupil boundary can be identified according to the characteristics of the image, and after the pupil boundary is identified, the fitting and the size calculation of the pupil are carried out by adopting a Hough circle fitting method. The principle of standard Hough transform is to convert the image space into a parameter space, then to detect the center of a circle, and to deduce the radius of the circle from the center of the circle. This completes the detection of the pupil size.
Step 5, judging whether the video is true or false
The method aims at detection of a Deepfake video, so that a detection sample is a video, moviepy is used for extracting audio from the video, the audio is input into a voice recognition model Y after being processed, the voice recognition model Y outputs a corresponding text, the output text is input into a text emotion classification model M after being processed, emotion corresponding to the text is obtained, then the size of pupils of human eyes in the video is detected, the size of the pupils of the human eyes is matched with the corresponding emotion or not by combining a table 1, and if the size of the pupils of the human eyes is not matched, the video is judged to be a false video.
TABLE 1
Emotional state Pupil size (unit: mm)
Pleasure 5.34±1.41
Quiet 3.50±1.25
Anxiety disorder 3.17±0.86
Impatience 4.91±1.81
The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims (9)

1. A Deepfake detection method based on emotion recognition and pupil size calculation is characterized by comprising the following steps:
(1) selecting a corpus of voice data, dividing the voice data into a voice training set X and a voice test set Q, then performing voice data processing, and training and testing a training voice recognition model Y;
(2) selecting a corpus of text data, dividing the text data into a text training set N and a text test set P, then processing the text data, and training and testing a training text emotion classification model M;
(3) extracting audio from a video of a Deepfake video to be detected, processing the audio by data, inputting the processed audio into a voice recognition model Y, outputting a corresponding text by the voice recognition model Y, processing the output text by data, and inputting the processed text into a text emotion classification model M to obtain an emotion corresponding to the text;
(4) converting a Deepfake video to be detected into a picture frame, extracting a face part in the picture frame, and detecting the size of a pupil of a human eye;
(5) matching the detected pupil size of the human eye with the emotion obtained by the text emotion classification model M, and if the detected pupil size of the human eye is not matched with the emotion obtained by the text emotion classification model M, judging that the video is a false video; if so, the video is judged to be true.
2. The method for detepfake detection based on emotion recognition and pupil size calculation according to claim 1, wherein in step (1), the corpus of speech data adopts a CASIA chinese emotion corpus, and the speech data processing includes:
and filtering the voice training set X and the voice test set Q to remove noise, and then extracting voice characteristic parameters MFCC from the voice training set X.
3. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 2, wherein in step (1), said speech recognition model Y is a Deepspeed 2 model with Baidu open source, and the trained loss function is a connected-ambiguity time classification algorithm (CTC) which is a CTCLossIs defined as follows
CTCLoss(f(x),T)=-logP(T|f(x))
Where, y ═ f (x) is the probability distribution of the output character, and T is the corresponding text.
4. The method for deteffake detection based on emotion recognition and pupil size calculation as claimed in claim 1, wherein in step (2), the corpus of text data adopts the chinese microblog data set of NLPCC2013, and the text data processing comprises:
for text data in a voice training set X and a voice testing set Q, a corpus is converted into word vectors, word vectors are trained by adopting word2vec of Google, mapping from the words to the word vectors is established after the word vectors are trained, and word vector coding is carried out on the text through an Embedding function of keras.
5. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 4, wherein in step (2), the text emotion classification model M adopts convolution network, the convolution kernel size is 3 × 3, the step size is 1, batch normalization is added to the convolution layer and the maximum pooling layer, and then the normalized results are input into the activation function, wherein the activation function adopted is ReLU, after the features are extracted by two-dimensional separable convolution, the extracted features are input into the GRU layer, after GRU, the extracted features are input into the full link layer, and finally classification is performed by softmax classifier.
6. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 5, wherein the loss function loss trained by the text emotion classification model M is in a cross entropy form, and the formula is as follows:
Figure FDA0002852432920000021
where M is the number of classes, y indicates a variable of 0 or 1, 1 if the class is the same as the class of the sample, otherwise 0, pcIs the predicted probability that the observed sample belongs to class c.
7. The method for detecting Deepfake based on emotion recognition and pupil size calculation as claimed in claim 1, wherein the specific process of step (4) is:
converting the Deepfake video to be detected into a frame-by-frame picture by using OpenCV;
extracting the human face in the picture by using a dlib tool, detecting key points of human eyes, and segmenting the human eyes;
median filtering is carried out on the human eye picture, and the noise of normal distribution is filtered out by adopting a 7 multiplied by 7 filtering template; carrying out threshold processing on the image to obtain a black and white picture with different contrast; then carrying out edge detection on the picture;
performing freeman chain code coding on the boundary information after the image edge detection to extract the edge in the image, and identifying the pupil boundary according to the edge characteristics;
after the pupil boundary is identified, the size of the pupil is calculated.
8. The method of claim 7, wherein the freeman chain code is an 8-way chain code.
9. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 7, wherein fitting and size calculation of the pupil are performed by using a Hough circle fitting method, specifically: the image space is converted into a parameter space, then the center of a circle is detected, and the radius of the circle is deduced from the center of the circle, so that the detection of the size of the pupil is completed.
CN202011532434.8A 2020-12-22 2020-12-22 Deepfake detection method based on emotion recognition and pupil size calculation Active CN112699236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011532434.8A CN112699236B (en) 2020-12-22 2020-12-22 Deepfake detection method based on emotion recognition and pupil size calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011532434.8A CN112699236B (en) 2020-12-22 2020-12-22 Deepfake detection method based on emotion recognition and pupil size calculation

Publications (2)

Publication Number Publication Date
CN112699236A true CN112699236A (en) 2021-04-23
CN112699236B CN112699236B (en) 2022-07-01

Family

ID=75510687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011532434.8A Active CN112699236B (en) 2020-12-22 2020-12-22 Deepfake detection method based on emotion recognition and pupil size calculation

Country Status (1)

Country Link
CN (1) CN112699236B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117059131A (en) * 2023-10-13 2023-11-14 南京龙垣信息科技有限公司 False audio detection method based on emotion recognition

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190392700A1 (en) * 2016-11-14 2019-12-26 Instant Care, Inc. Methods of and devices for filtering out false alarms to the call centers using a non-gui based user interface for a user to input a control command
CN110969106A (en) * 2019-11-25 2020-04-07 东南大学 Multi-mode lie detection method based on expression, voice and eye movement characteristics
CN111160286A (en) * 2019-12-31 2020-05-15 中国电子科技集团公司信息科学研究院 Video authenticity identification method
CN111738199A (en) * 2020-06-30 2020-10-02 中国工商银行股份有限公司 Image information verification method, image information verification device, image information verification computing device and medium
US20200349337A1 (en) * 2019-05-01 2020-11-05 Accenture Global Solutions Limited Emotion sensing artificial intelligence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190392700A1 (en) * 2016-11-14 2019-12-26 Instant Care, Inc. Methods of and devices for filtering out false alarms to the call centers using a non-gui based user interface for a user to input a control command
US20200349337A1 (en) * 2019-05-01 2020-11-05 Accenture Global Solutions Limited Emotion sensing artificial intelligence
CN110969106A (en) * 2019-11-25 2020-04-07 东南大学 Multi-mode lie detection method based on expression, voice and eye movement characteristics
CN111160286A (en) * 2019-12-31 2020-05-15 中国电子科技集团公司信息科学研究院 Video authenticity identification method
CN111738199A (en) * 2020-06-30 2020-10-02 中国工商银行股份有限公司 Image information verification method, image information verification device, image information verification computing device and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐天宇等: "基于特征点检测的面部情感分析应用与研究", 《电脑与信息技术》, vol. 28, no. 3, 30 June 2020 (2020-06-30), pages 13 - 16 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117059131A (en) * 2023-10-13 2023-11-14 南京龙垣信息科技有限公司 False audio detection method based on emotion recognition
CN117059131B (en) * 2023-10-13 2024-03-29 南京龙垣信息科技有限公司 False audio detection method based on emotion recognition

Also Published As

Publication number Publication date
CN112699236B (en) 2022-07-01

Similar Documents

Publication Publication Date Title
CN105976809B (en) Identification method and system based on speech and facial expression bimodal emotion fusion
JP6617053B2 (en) Utterance semantic analysis program, apparatus and method for improving understanding of context meaning by emotion classification
WO2015158017A1 (en) Intelligent interaction and psychological comfort robot service system
CN111583964B (en) Natural voice emotion recognition method based on multimode deep feature learning
CN112686048B (en) Emotion recognition method and device based on fusion of voice, semantics and facial expressions
Dhuheir et al. Emotion recognition for healthcare surveillance systems using neural networks: A survey
CN113380271B (en) Emotion recognition method, system, device and medium
CN111326178A (en) Multi-mode speech emotion recognition system and method based on convolutional neural network
CN109872714A (en) A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN114550057A (en) Video emotion recognition method based on multi-modal representation learning
Alghifari et al. On the use of voice activity detection in speech emotion recognition
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN115455136A (en) Intelligent digital human marketing interaction method and device, computer equipment and storage medium
CN112699236B (en) Deepfake detection method based on emotion recognition and pupil size calculation
CN117198338B (en) Interphone voiceprint recognition method and system based on artificial intelligence
Dweik et al. Read my lips: Artificial intelligence word-level arabic lipreading system
Gadhe et al. Emotion recognition from speech: a survey
CN114881668A (en) Multi-mode-based deception detection method
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device
Zhu et al. Emotion Recognition of College Students Based on Audio and Video Image.
CN114120425A (en) Emotion recognition method and device, electronic equipment and storage medium
CN114170997A (en) Pronunciation skill detection method, pronunciation skill detection device, storage medium and electronic equipment
CN114999633A (en) Depression identification method and system based on multi-mode shared vector space
Aripin et al. Indonesian lip-reading recognition using long-term recurrent convolutional network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant