CN112699236A - Deepfake detection method based on emotion recognition and pupil size calculation - Google Patents
Deepfake detection method based on emotion recognition and pupil size calculation Download PDFInfo
- Publication number
- CN112699236A CN112699236A CN202011532434.8A CN202011532434A CN112699236A CN 112699236 A CN112699236 A CN 112699236A CN 202011532434 A CN202011532434 A CN 202011532434A CN 112699236 A CN112699236 A CN 112699236A
- Authority
- CN
- China
- Prior art keywords
- text
- emotion
- voice
- pupil
- deepfake
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 210000001747 pupil Anatomy 0.000 title claims abstract description 53
- 238000004364 calculation method Methods 0.000 title claims abstract description 22
- 238000001514 detection method Methods 0.000 title claims abstract description 22
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 15
- 230000008451 emotion Effects 0.000 claims abstract description 50
- 238000012549 training Methods 0.000 claims abstract description 46
- 238000000034 method Methods 0.000 claims abstract description 38
- 238000012360 testing method Methods 0.000 claims abstract description 28
- 238000012545 processing Methods 0.000 claims abstract description 25
- 238000013145 classification model Methods 0.000 claims abstract description 22
- 239000013598 vector Substances 0.000 claims description 17
- 238000001914 filtration Methods 0.000 claims description 9
- 238000003708 edge detection Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000007635 classification algorithm Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 13
- 238000004458 analytical method Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 238000001228 spectrum Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000002996 emotional effect Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 3
- 208000019901 Anxiety disease Diseases 0.000 description 2
- 206010049976 Impatience Diseases 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 210000000887 face Anatomy 0.000 description 2
- 230000007935 neutral effect Effects 0.000 description 2
- 210000002460 smooth muscle Anatomy 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000036506 anxiety Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 210000000467 autonomic pathway Anatomy 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000008602 contraction Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000008449 language Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 239000011148 porous material Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/18—Eye characteristics, e.g. of the iris
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
- G06T2207/30201—Face
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Ophthalmology & Optometry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Quality & Reliability (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a Deepfake detection method based on emotion recognition and pupil size calculation, which comprises the following steps of: (1) dividing the voice data into a training set X and a testing set Q, then carrying out data processing, and training and testing a training voice recognition model Y; (2) dividing text data into a training set N and a testing set P, then performing data processing, and training and testing a training text emotion classification model M; (3) extracting audio from a Deepfake video to be detected, inputting the audio into a voice recognition model Y, and inputting an output text into a text emotion classification model M to obtain an emotion corresponding to the text; (4) converting the Deepfake video to be detected into a picture frame, and detecting the size of the pupil of the human eye; (5) and matching the detected pupil size of the human eye with the emotion obtained by the text emotion classification model M, and judging the video to be a false video if the detected pupil size of the human eye is not matched with the emotion obtained by the text emotion classification model M. The method can better detect the false videos generated by different methods, and has strong generalization capability.
Description
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a Deepfake detection method based on emotion recognition and pupil size calculation.
Background
The speech recognition technology is to make a computer understand what a person says so as to realize speech communication between a human and a machine, and also to output the words spoken by the human in a text form. In recent years, speech recognition technology has significantly advanced, and is beginning to move from laboratories to life of people, such as speech assistants, speech translation and the like in smart phones. Common methods used in speech recognition technology are stochastic modeling, probabilistic parsing, linguistic and acoustic based methods, and methods using artificial neural networks, among which stochastic modeling is most commonly used.
For example, chinese patent publication No. CN106792140A discloses a broadcast television advertisement monitoring system based on voice recognition, which performs model construction on sample voice and characteristic values of voice to be recognized through a voice recognition modeling module, so as to ensure accuracy of recognition between the characteristics to be recognized and template characteristics; the matched sound is quantized through the sound matching module, so that the matching accuracy is improved; the recognition method adopted by the voice recognition modeling module comprises a template matching method and a random model method.
The text sentiment analysis is to analyze a text with subjective sentiment colors to obtain corresponding sentiment attribution of the text, a large number of comments of a user to an event, a character, a product and the like exist on the Internet, the comments comprise the sentiment tendency of the user, and the opinion of the public on the event, the character, the product and the like can be analyzed through the text sentiment analysis. According to the processed fine-grained different text emotion analysis, the method can be divided into three research levels, namely word level, sentence level and chapter level. The invention uses sentence-level text emotion analysis.
The pupil size of normal people is related to emotional state, the pupil expansion and contraction are controlled by smooth muscle, the smooth muscle is controlled by autonomic nerve, the human consciousness can not change, a person has a way to control own behavior, language and action, or has no way to control own pupil, especially the slight change of the pupil has no way to control. Psychological studies have shown that a person's pupil size reflects his current emotional state, expanding 4 to 5 times as much as one would like to feel happy or excited, and contracting involuntarily little as one would feel angry or bored.
At present, with the advent of the Deepfake technology, people have difficulty in distinguishing some false videos or pictures with naked eyes, and some false pictures or videos which have great social influence exist on the network. For example, changing faces of some public people to allow them to disseminate false talk or other people with malicious defamation. Therefore, the detection of these false pictures or videos is very important, but the current Deepfake technology has some defects that the details of some faces are not forged enough, such as the size change of the pupil, the scaling of the pore, etc.
Disclosure of Invention
The invention provides a Deepfake detection method based on emotion recognition and pupil size calculation, which can solve the problems that the application scene of the existing Deepfake detection technology is not comprehensive enough, overfitting to a certain Deepfake mode is often caused, and the generalization capability is lacked.
A Deepfake detection method based on emotion recognition and pupil size calculation comprises the following steps:
(1) selecting a corpus of voice data, dividing the voice data into a voice training set X and a voice test set Q, then performing voice data processing, and training and testing a training voice recognition model Y;
(2) selecting a corpus of text data, dividing the text data into a text training set N and a text test set P, then processing the text data, and training and testing a training text emotion classification model M;
(3) extracting audio from a video of a Deepfake video to be detected, processing the audio by data, inputting the processed audio into a voice recognition model Y, outputting a corresponding text by the voice recognition model Y, processing the output text by data, and inputting the processed text into a text emotion classification model M to obtain an emotion corresponding to the text;
(4) converting a Deepfake video to be detected into a picture frame, extracting a face part in the picture frame, and detecting the size of a pupil of a human eye;
(5) matching the detected pupil size of the human eye with the emotion obtained by the text emotion classification model M, and if the detected pupil size of the human eye is not matched with the emotion obtained by the text emotion classification model M, judging that the video is a false video; if so, the video is judged to be true.
In step (1), the corpus of speech data adopts a CASIA Chinese emotion corpus, and the speech data processing comprises:
and filtering the voice training set X and the voice test set Q to remove noise, and then extracting voice characteristic parameters MFCC from the voice training set X.
The speech recognition model Y adopts a hundred-degree open source Deepseech 2 model, the training loss function adopts a connection-ambiguity time classification algorithm CTC, and CTCLOs is defined as follows
CTCLoss(f(x),T)=-logP(T|f(x))
Where, y ═ f (x) is the probability distribution of the output character, and T is the corresponding text.
In the step (2), the corpus of the text data adopts a Chinese microblog data set of NLPCC2013, and the text data processing comprises the following steps:
for text data in a voice training set X and a voice testing set Q, a corpus is converted into word vectors, word vectors are trained by adopting word2vec of Google, mapping from the words to the word vectors is established after the word vectors are trained, and word vector coding is carried out on the text through an Embedding function of keras.
The text emotion classification model M adopts a convolution network, the size of a convolution kernel is 3 multiplied by 3, the step length is 1, batch normalization is added into a convolution layer and a maximum pooling layer, and the normalization is input into an activation function, wherein the adopted activation function is a ReLU, after the characteristics are extracted through two-dimensional separable convolution, the extracted characteristics are input into a GRU layer, and after the GRU, the GRU is input into a full connection layer, and the classification is carried out by adopting a softmax classifier;
the loss function loss of model training adopts a cross entropy form, and the formula is as follows:
where M is the number of classes, y indicates a variable of 0 or 1, 1 if the class is the same as the class of the sample, otherwise 0, pcIs the predicted probability that the observed sample belongs to class c.
The specific process of the step (4) is as follows:
converting the Deepfake video to be detected into a frame-by-frame picture by using OpenCV;
extracting the human face in the picture by using a dlib tool, detecting key points of human eyes, and segmenting the human eyes;
median filtering is carried out on the human eye picture, and the noise of normal distribution is filtered out by adopting a 7 multiplied by 7 filtering template; carrying out threshold processing on the image to obtain a black and white picture with different contrast; then carrying out edge detection on the picture;
performing freeman chain code coding on the boundary information after the image edge detection to extract the edge in the image, and identifying the pupil boundary according to the edge characteristics;
after the pupil boundary is identified, the size of the pupil is calculated.
Furthermore, the freeman chain code adopts an 8-communication chain code.
Further, fitting and size calculation of the pupil are performed by adopting a Hough circle fitting method, which specifically comprises the following steps: the image space is converted into a parameter space, then the center of a circle is detected, and the radius of the circle is deduced from the center of the circle, so that the detection of the size of the pupil is completed.
The invention converts the voice of the person speaking into the text, classifies the text in emotion to obtain the emotional state of the person speaking, and compares the size of the pupil to judge the true and false video.
The invention has the following beneficial effects:
the method is used for detecting the physiological characteristics of human eyes, can better detect the false videos generated by different Deepfake methods, and has strong generalization capability and wide application range.
Drawings
FIG. 1 is an overall method flow diagram in an embodiment of the invention;
FIG. 2 is a flow diagram of speech recognition in an embodiment of the present invention;
FIG. 3 is a flow diagram of textual emotion analysis in an embodiment of the present invention;
FIG. 4 is a flow chart of pupil size calculation according to an embodiment of the present invention;
FIG. 5 is a diagram of a speech recognition model Y in accordance with an embodiment of the present invention;
FIG. 6 is a diagram of a text emotion classification model M according to an embodiment of the present invention.
Detailed Description
The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.
As shown in fig. 1, a method for deteffake detection based on emotion recognition and pupil size calculation includes:
step 1, data processing
(1-1) data set
The CASIA Chinese emotion corpus is used as a training data set of the voice recognition model Y, the CASIA Chinese emotion corpus is recorded by Institute of Automation of Automation of Chinese Academy of Sciences, and comprises four professional pronunciations, six emotional vitality (anger), happy (happy), fear (fear), sadness (sad), surprise (surpride) and neutral (neutral), and 9600 sentences of different pronunciations. Of which 300 sentences are of the same text and 100 sentences are of different text. What is the same text is that four professional pronunciations read with different 6 emotions, for a total of 300 × 4 × 6 ═ 7200 sentences. Different texts refer to emotions in sentences which can be seen from the literal meaning, and the total of 100 × 4 × 6 is 2400 sentences, and the process of speech recognition is shown in fig. 2.
The Chinese microblog data set of NLPCC2013 is used as a training data set of a text emotion classification model M, the corpus is mainly used for identifying emotions expressed by the whole microblog, the emotion is not simply accepted and dislocated classification, but relates to a plurality of fine-grained emotion classes (such as sadness, worry, happiness, excitement and the like), and belongs to the fine-grained emotion classification problem, and the text emotion classification process is shown in FIG. 3.
(1-2) dividing training set and test set
The invention trains the speech recognition model Y only needs to use the pronunciation of different texts in the CASIA Chinese emotion corpus, 2400 sentences are shared, 100 sentences of different texts are divided into 4: 1 into a training set X and a test set Q, wherein the training set X has 1920-sentence reading voices with 80 different texts, and the test set Q has 480-sentence reading voices with 20 different texts. The training set X is used for training the voice recognition model Y, and the testing set Q is used for testing the recognition accuracy of the voice recognition model Y.
The Chinese microblog data set of the NLPCC2013 is divided into 4 parts: the scale of 1 is divided into a training set N and a test set P.
(1-3) processing the data set
The training set X and the test set Q are voice data, and noise is removed by filtering the voice data. The training set X is then subjected to extraction of the speech feature parameters MFCC, i.e. Mel-Frequency Cepstrum, which is a short power spectrum of sound that transforms the audio signal by simulating the human auditory perception in a series of steps. The process comprises the following steps: the method comprises the steps of performing pre-emphasis, framing and windowing on a voice, then obtaining a corresponding frequency spectrum for each short time analysis window through Fast Fourier Transform (FFT), converting a linear natural frequency spectrum into a Mel frequency spectrum reflecting human auditory characteristics through the Mel frequency spectrum, and finally performing cepstrum analysis on the Mel frequency spectrum to obtain a Mel frequency cepstrum coefficient MFCC which is the characteristic of the frame of voice.
The Chinese microblog data set of NLPCC2013 is a corpus, the input of a text emotion classification model M is word vectors, a corpus is converted into word vectors, a word segmentation tool, namely a jieba word segmentation is used for segmenting the corpus and removing stop words, then word vectors are trained by using word2vec of Google, the word2vec is a tool package which is derived by Google in 2013 and used for obtaining word vectors, mapping from the words to the word vectors is established after the word vectors are trained, word vector coding is carried out on the text through an Embedding function of a key, and a training set N and a test set P are obtained.
Step 2, training the model
And (2-1) training the speech recognition model Y by taking the speech training set X divided in the step 1 as input.
The speech recognition model Y adopts a centesimal open-source Deepseech 2 model, and the Deepseech 2 model is based on a centesimal Paddle framework, and is powerful in function, simple and easy to use. The structure of the model is shown in fig. 5, and the model is composed of three parts, the first part is 2D invariant convolution (2D invariant convolution), and the second part is a gated cyclic unit structure gru (gated recurrentunit), which is a variant of the conventional RNN, and can effectively capture semantic association between long sequences as LSTM, and alleviate gradient disappearance or explosion phenomena. Meanwhile, the structure and calculation of the method are simpler than those of the LSTM, the last part is a Fully Connected Layer (full Connected Layer), and the output is shaped by adopting the Fully Connected Layer. Logits are obtained to compute the CTC loss function and decode. Batch Normalization (Batch Normalization) is used for each layer of input in the model to reduce the distribution gap between input and output, increase the generalization capability of the model and accelerate training. The input of the model is a spectrogram (spectrogram) of a power normalized audio clip, and the output is simplified Chinese.
The loss function employs a connection-dominant Temporal Classification (CTC) algorithm. The main advantage is that the data which is not aligned can be automatically aligned, and the method is mainly used for training serialized data which is not aligned in advance, such as speech recognition. The loss function CTCLOs for CTC may be interpreted as the sum of the probabilities that the correct stable is output after a given sample. CTCLOs is defined as follows:
CTCLoss(f(x),T)=-logP(T|f(x))
where y ═ f (x) is the probability distribution of the output character, and T is the corresponding text.
And (2-2) training the text emotion classification model M by taking the training set N divided in the step 1 as input.
The structure of the text emotion classification model M is shown in FIG. 6, the size of a convolution kernel adopted in two-dimensional convolution is 3 x 3, the step length is 1, in order to prevent over-fitting of training and improve the convergence rate of training, Batch Normalization (Batch Normalization) is added in a convolution layer and a maximum pooling layer, and is input into an activation function after Normalization, wherein the adopted activation function is ReLU, after the characteristics are extracted through two-dimensional separable convolution, the extracted characteristics are input into a GRU layer, and are input into a full connection layer after GRU, and classification is carried out by adopting a softmax classifier. The results of the classification are set into four categories, namely four emotions of pleasure, calmness, anxiety and impatience. The loss function loss of the model takes a cross entropy form, and the formula is as follows:
where M is the number of classes, y indicates a variable (0 or 1) which is 1 if the class is the same as the class of the sample, otherwise 0, pcIs the predicted probability that the observed sample belongs to class c.
Step 3, testing the model
Respectively inputting the test set P and the test set Q divided in the step 1) into a trained voice recognition model Y and a trained text emotion classification model M, and obtaining the classification accuracy of the model and the test set Q according to the output of the model.
Step 4, pupil size calculation
The pupil size calculation needs some digital image processing techniques, OpenCV is adopted to process eye images, OpenCV is a cross-platform computer vision and machine learning software library issued based on BSD licensing (open source), and the vision processing algorithms provided by OpenCV are very rich.
As shown in fig. 4, the pupil size detection process includes firstly converting a video into a frame-by-frame picture by using OpenCV, then extracting a face in the picture by using dlib tool, and then detecting key points of human eyes to segment the human eyes. Then, median filtering is carried out on the human eye picture, a filtering template of 7 multiplied by 7 is adopted to filter noise in normal distribution, then threshold processing is carried out on the picture, the threshold processing is to set a threshold value to carry out binarization processing on the picture, finally, a black and white picture with different contrast is obtained, a one-dimensional maximum entropy threshold value segmentation method is adopted to carry out threshold processing on the picture, then edge detection is carried out on the picture, the purpose of edge detection is to identify points with obvious brightness change in the digital picture, and a Prewitt operator (Prewitt operator) is adopted to carry out edge detection, wherein the Prewitt operator utilizes the gray difference of upper, lower, left and right adjacent points of a pixel point to achieve extreme value edge detection at the edge. The freeman chain code coding is a method for describing a curve or a boundary by using coordinates of a curve starting point and a boundary point direction code, and is often used for representing the boundary of the curve and a region in the fields of image processing, computer graphics, mode recognition and the like. The adopted 8-connected chain codes have four adjacent points which are respectively arranged at the upper part, the upper right part, the lower left part and the upper left part of the central point. The 8-connected chain code is consistent with the actual pixel point, and the information of the central pixel point and the adjacent point can be accurately described. After the freeman chain code coding is carried out to extract the edge in the image, the pupil boundary can be identified according to the characteristics of the image, and after the pupil boundary is identified, the fitting and the size calculation of the pupil are carried out by adopting a Hough circle fitting method. The principle of standard Hough transform is to convert the image space into a parameter space, then to detect the center of a circle, and to deduce the radius of the circle from the center of the circle. This completes the detection of the pupil size.
Step 5, judging whether the video is true or false
The method aims at detection of a Deepfake video, so that a detection sample is a video, moviepy is used for extracting audio from the video, the audio is input into a voice recognition model Y after being processed, the voice recognition model Y outputs a corresponding text, the output text is input into a text emotion classification model M after being processed, emotion corresponding to the text is obtained, then the size of pupils of human eyes in the video is detected, the size of the pupils of the human eyes is matched with the corresponding emotion or not by combining a table 1, and if the size of the pupils of the human eyes is not matched, the video is judged to be a false video.
TABLE 1
Emotional state | Pupil size (unit: mm) |
Pleasure | 5.34±1.41 |
Quiet | 3.50±1.25 |
Anxiety disorder | 3.17±0.86 |
Impatience | 4.91±1.81 |
The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.
Claims (9)
1. A Deepfake detection method based on emotion recognition and pupil size calculation is characterized by comprising the following steps:
(1) selecting a corpus of voice data, dividing the voice data into a voice training set X and a voice test set Q, then performing voice data processing, and training and testing a training voice recognition model Y;
(2) selecting a corpus of text data, dividing the text data into a text training set N and a text test set P, then processing the text data, and training and testing a training text emotion classification model M;
(3) extracting audio from a video of a Deepfake video to be detected, processing the audio by data, inputting the processed audio into a voice recognition model Y, outputting a corresponding text by the voice recognition model Y, processing the output text by data, and inputting the processed text into a text emotion classification model M to obtain an emotion corresponding to the text;
(4) converting a Deepfake video to be detected into a picture frame, extracting a face part in the picture frame, and detecting the size of a pupil of a human eye;
(5) matching the detected pupil size of the human eye with the emotion obtained by the text emotion classification model M, and if the detected pupil size of the human eye is not matched with the emotion obtained by the text emotion classification model M, judging that the video is a false video; if so, the video is judged to be true.
2. The method for detepfake detection based on emotion recognition and pupil size calculation according to claim 1, wherein in step (1), the corpus of speech data adopts a CASIA chinese emotion corpus, and the speech data processing includes:
and filtering the voice training set X and the voice test set Q to remove noise, and then extracting voice characteristic parameters MFCC from the voice training set X.
3. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 2, wherein in step (1), said speech recognition model Y is a Deepspeed 2 model with Baidu open source, and the trained loss function is a connected-ambiguity time classification algorithm (CTC) which is a CTCLossIs defined as follows
CTCLoss(f(x),T)=-logP(T|f(x))
Where, y ═ f (x) is the probability distribution of the output character, and T is the corresponding text.
4. The method for deteffake detection based on emotion recognition and pupil size calculation as claimed in claim 1, wherein in step (2), the corpus of text data adopts the chinese microblog data set of NLPCC2013, and the text data processing comprises:
for text data in a voice training set X and a voice testing set Q, a corpus is converted into word vectors, word vectors are trained by adopting word2vec of Google, mapping from the words to the word vectors is established after the word vectors are trained, and word vector coding is carried out on the text through an Embedding function of keras.
5. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 4, wherein in step (2), the text emotion classification model M adopts convolution network, the convolution kernel size is 3 × 3, the step size is 1, batch normalization is added to the convolution layer and the maximum pooling layer, and then the normalized results are input into the activation function, wherein the activation function adopted is ReLU, after the features are extracted by two-dimensional separable convolution, the extracted features are input into the GRU layer, after GRU, the extracted features are input into the full link layer, and finally classification is performed by softmax classifier.
6. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 5, wherein the loss function loss trained by the text emotion classification model M is in a cross entropy form, and the formula is as follows:
where M is the number of classes, y indicates a variable of 0 or 1, 1 if the class is the same as the class of the sample, otherwise 0, pcIs the predicted probability that the observed sample belongs to class c.
7. The method for detecting Deepfake based on emotion recognition and pupil size calculation as claimed in claim 1, wherein the specific process of step (4) is:
converting the Deepfake video to be detected into a frame-by-frame picture by using OpenCV;
extracting the human face in the picture by using a dlib tool, detecting key points of human eyes, and segmenting the human eyes;
median filtering is carried out on the human eye picture, and the noise of normal distribution is filtered out by adopting a 7 multiplied by 7 filtering template; carrying out threshold processing on the image to obtain a black and white picture with different contrast; then carrying out edge detection on the picture;
performing freeman chain code coding on the boundary information after the image edge detection to extract the edge in the image, and identifying the pupil boundary according to the edge characteristics;
after the pupil boundary is identified, the size of the pupil is calculated.
8. The method of claim 7, wherein the freeman chain code is an 8-way chain code.
9. The method for Deepfake detection based on emotion recognition and pupil size calculation as claimed in claim 7, wherein fitting and size calculation of the pupil are performed by using a Hough circle fitting method, specifically: the image space is converted into a parameter space, then the center of a circle is detected, and the radius of the circle is deduced from the center of the circle, so that the detection of the size of the pupil is completed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011532434.8A CN112699236B (en) | 2020-12-22 | 2020-12-22 | Deepfake detection method based on emotion recognition and pupil size calculation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011532434.8A CN112699236B (en) | 2020-12-22 | 2020-12-22 | Deepfake detection method based on emotion recognition and pupil size calculation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112699236A true CN112699236A (en) | 2021-04-23 |
CN112699236B CN112699236B (en) | 2022-07-01 |
Family
ID=75510687
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011532434.8A Active CN112699236B (en) | 2020-12-22 | 2020-12-22 | Deepfake detection method based on emotion recognition and pupil size calculation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112699236B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117059131A (en) * | 2023-10-13 | 2023-11-14 | 南京龙垣信息科技有限公司 | False audio detection method based on emotion recognition |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190392700A1 (en) * | 2016-11-14 | 2019-12-26 | Instant Care, Inc. | Methods of and devices for filtering out false alarms to the call centers using a non-gui based user interface for a user to input a control command |
CN110969106A (en) * | 2019-11-25 | 2020-04-07 | 东南大学 | Multi-mode lie detection method based on expression, voice and eye movement characteristics |
CN111160286A (en) * | 2019-12-31 | 2020-05-15 | 中国电子科技集团公司信息科学研究院 | Video authenticity identification method |
CN111738199A (en) * | 2020-06-30 | 2020-10-02 | 中国工商银行股份有限公司 | Image information verification method, image information verification device, image information verification computing device and medium |
US20200349337A1 (en) * | 2019-05-01 | 2020-11-05 | Accenture Global Solutions Limited | Emotion sensing artificial intelligence |
-
2020
- 2020-12-22 CN CN202011532434.8A patent/CN112699236B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190392700A1 (en) * | 2016-11-14 | 2019-12-26 | Instant Care, Inc. | Methods of and devices for filtering out false alarms to the call centers using a non-gui based user interface for a user to input a control command |
US20200349337A1 (en) * | 2019-05-01 | 2020-11-05 | Accenture Global Solutions Limited | Emotion sensing artificial intelligence |
CN110969106A (en) * | 2019-11-25 | 2020-04-07 | 东南大学 | Multi-mode lie detection method based on expression, voice and eye movement characteristics |
CN111160286A (en) * | 2019-12-31 | 2020-05-15 | 中国电子科技集团公司信息科学研究院 | Video authenticity identification method |
CN111738199A (en) * | 2020-06-30 | 2020-10-02 | 中国工商银行股份有限公司 | Image information verification method, image information verification device, image information verification computing device and medium |
Non-Patent Citations (1)
Title |
---|
徐天宇等: "基于特征点检测的面部情感分析应用与研究", 《电脑与信息技术》, vol. 28, no. 3, 30 June 2020 (2020-06-30), pages 13 - 16 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117059131A (en) * | 2023-10-13 | 2023-11-14 | 南京龙垣信息科技有限公司 | False audio detection method based on emotion recognition |
CN117059131B (en) * | 2023-10-13 | 2024-03-29 | 南京龙垣信息科技有限公司 | False audio detection method based on emotion recognition |
Also Published As
Publication number | Publication date |
---|---|
CN112699236B (en) | 2022-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105976809B (en) | Identification method and system based on speech and facial expression bimodal emotion fusion | |
JP6617053B2 (en) | Utterance semantic analysis program, apparatus and method for improving understanding of context meaning by emotion classification | |
WO2015158017A1 (en) | Intelligent interaction and psychological comfort robot service system | |
CN111583964B (en) | Natural voice emotion recognition method based on multimode deep feature learning | |
CN112686048B (en) | Emotion recognition method and device based on fusion of voice, semantics and facial expressions | |
Dhuheir et al. | Emotion recognition for healthcare surveillance systems using neural networks: A survey | |
CN113380271B (en) | Emotion recognition method, system, device and medium | |
CN111326178A (en) | Multi-mode speech emotion recognition system and method based on convolutional neural network | |
CN109872714A (en) | A kind of method, electronic equipment and storage medium improving accuracy of speech recognition | |
WO2023226239A1 (en) | Object emotion analysis method and apparatus and electronic device | |
CN114550057A (en) | Video emotion recognition method based on multi-modal representation learning | |
Alghifari et al. | On the use of voice activity detection in speech emotion recognition | |
CN112735404A (en) | Ironic detection method, system, terminal device and storage medium | |
CN115455136A (en) | Intelligent digital human marketing interaction method and device, computer equipment and storage medium | |
CN112699236B (en) | Deepfake detection method based on emotion recognition and pupil size calculation | |
CN117198338B (en) | Interphone voiceprint recognition method and system based on artificial intelligence | |
Dweik et al. | Read my lips: Artificial intelligence word-level arabic lipreading system | |
Gadhe et al. | Emotion recognition from speech: a survey | |
CN114881668A (en) | Multi-mode-based deception detection method | |
CN114492579A (en) | Emotion recognition method, camera device, emotion recognition device and storage device | |
Zhu et al. | Emotion Recognition of College Students Based on Audio and Video Image. | |
CN114120425A (en) | Emotion recognition method and device, electronic equipment and storage medium | |
CN114170997A (en) | Pronunciation skill detection method, pronunciation skill detection device, storage medium and electronic equipment | |
CN114999633A (en) | Depression identification method and system based on multi-mode shared vector space | |
Aripin et al. | Indonesian lip-reading recognition using long-term recurrent convolutional network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |