CN111368764B

CN111368764B - False video detection method based on computer vision and deep learning algorithm

Info

Publication number: CN111368764B
Application number: CN202010158340.2A
Authority: CN
Inventors: 姚一鸣
Original assignee: Zero Rank Technology Shenzhen Co ltd
Current assignee: Zero Rank Technology Shenzhen Co ltd
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2023-02-21
Anticipated expiration: 2040-03-09
Also published as: CN111368764A

Abstract

The invention belongs to the field of computer vision and deep learning, and discloses a false video detection method based on a computer vision and deep learning algorithm. Three feature extraction models, namely a generating confrontation network distinguishing model, a front/side face comparison model and an expression action classification model, are trained in advance, and then the trained three models are stored respectively. And then, respectively inputting the three models by using the training set to extract features, and performing linear fitting on the extracted features, wherein a fitting object is a real value of the training set. Optimization was performed using an Adam optimizer using the cross entropy loss function of the binary classification as a criterion. Taking parameters with final loss below 1e-6 as a final model for storage; and traversing the training set again, selecting the optimal threshold value of the ROC curve by using an elbow rule as a classification threshold value, and classifying the video by using a pre-calculated model and the threshold value in the algorithm application stage. The invention is beneficial to false video detection.

Description

False video detection method based on computer vision and deep learning algorithm

Technical Field

The invention belongs to the field of computer vision and deep learning, and particularly relates to a false video detection method based on a computer vision and deep learning algorithm, which is used for quickly judging a false generated video file.

Background

Deep learning technology is used for replacing face parts in a video or a picture, and a large amount of false videos with replaced faces can be automatically generated by the deep learning technology. Due to the use of the open-source deep learning algorithm model, the false videos are easy to generate, and the generation of the false videos can be easily achieved by using a household desktop computer. This technology was originally created for the convenience of movie and animation, and was intended to facilitate some labor savings for the relevant staff. However, some people begin to use the technology to generate false videos which harm the interests of others, so many large companies (***, hundredths) begin to invest in resource research for fast and accurate false video screening methods.

For example, chinese patent publication No. CN 110188706A discloses a neural network training method and a detection method based on character expressions in a video for generating an antagonistic network, the method first reads in a video vreal, the video takes one expression of a character as a main part; then, calculating a characteristic function f of the system through convolutional neural network processing; then, one possible expression yi and the previous characteristic function f are matched with the expression yi through a deconvolution neural network, and finally the person video generated by the computer obtains the matching degree si of the two through a 3D convolution neural network and the vreal; by changing yi to get different si, the largest corresponding yi in si is used as the decision output. The feature extraction model is mainly used for neural network training of character expressions, and the problem of screening false videos cannot be solved.

Disclosure of Invention

The invention aims to provide a detection method for effectively screening false videos, and the detection method can improve the efficiency and accuracy of false video detection by training an effective feature extraction model.

The false video detection method based on the computer vision and deep learning algorithm adopted for solving the problems specifically comprises the following steps:

s1: downloading an algorithm for generating a false video in advance, and generating the false video and an unmodified video by utilizing self data, wherein one part of the false video and the unmodified video are used as a test set and the other part of the false video and the unmodified video are used as a training set;

s2: pre-training at least one feature extraction model, and storing the trained feature extraction model;

s3: respectively inputting the training sets into the feature extraction model for feature extraction, and performing linear fitting on the extracted features, wherein a fitting object is a real value of the training set, and the test set is used for quantifying the model training quality during model training;

s4: the feature extraction model comprises a loss function, the loss function is defined, and an optimizer is used for optimizing by using the defined loss function as a standard;

s5: saving the feature extraction model with the optimized loss function as a final model; traversing the training set again, selecting a classification threshold value, and entering an algorithm application stage;

s6: in the algorithm application stage, inputting a video generated by using new data into the final model for feature extraction;

s7: and bringing the extracted features into a linear regression model and extracting the final activation layer for output, and if the average value of the results is greater than the classification threshold, judging the video to be a false video.

Further, to improve on the algorithm for the spurious video generation, in S1, the algorithm for the spurious video generation is a depfake dependent open source algorithm.

Further, in order to improve the optimization of the test set and the training set, in S1, the time duration of the false videos and the unmodified videos is 1-3 minutes, wherein the number of the false videos is 3000-3500, the number of the unmodified videos is 2000-2500, 15% -20% of all the videos are extracted to be used as the test set, and the remaining 80% -85% of the videos are used as the training set.

Further, in order to improve the optimization of the feature extraction model, in S2, three feature extraction models, namely, a "generative confrontation network classification model", a "front/side face comparison model", and an "expression action classification model", are trained in advance and stored respectively.

And integrating the target face picture and the video to be modified by adopting a generative countermeasure network (GAN) to generate a new video frame image, and then splicing the new video frame image into a complete video. There is a problem in doing so, that under a limited number of iterations, the trained network model cannot fit the target face picture to the face in the original video by one hundred percent. Therefore, the first entry point of the system judges whether the video image is modified or not by utilizing the color change rule of the periphery of the face area.

Further, in order to improve the training optimization of the generative confrontation network distinguishing model, the training mode of the generative confrontation network distinguishing model is to utilize samples extracted from a real video and samples extracted from a random parameter input into the deep fake model to perform comparative learning, wherein the samples are human face areas, unnatural color changes and splicing around false human face areas are identified by the confrontation model in the GAN, and the true and false videos are classified so as to define a loss function, and the confrontation model in the GAN is the distinguishing model.

The method comprises the steps that when a standard model of algorithms such as DeepFake is pre-trained, most of the standard models are trained by using a front image of a human face, so that side face images in a video cannot be well generated, video frames with horizontal rotation angles of faces larger than a certain numerical value in a video to be detected are extracted to be secondarily screened, the acquired side face images are compared with a plurality of front face images (horizontal rotation angles of faces are close to 0 degree) in the video, and the side face images are judged to be lower than a certain threshold value, namely when the similarity of the two face images is low, the side face images are not successfully generated by using target human images, and the front face images and the side face images of a certain person in the video belong to different persons respectively, so that the second tangent point of the system can be used for judging that the video is modified by using human face feature comparison.

Further, in order to improve the training optimization aspect of the 'front/side face comparison model', the training mode of the 'front/side face comparison model' is that false video data is led into an algorithm for face detection and angle judgment, a front face sample and a side face sample are extracted, an average face sample is extracted from the front face sample and is led into a first face recognition model for face feature extraction, the side face sample is led into a second face recognition model for face feature extraction, the 'front/side face comparison model' is a face feature comparison model, face features extracted from the first face recognition model and face features extracted from the second face recognition model are led into the face feature comparison model, and true and false videos are trained and classified.

Further, in order to improve the comparison mode of the human face features, the comparison mode of the human face features is the angle judgment of the human face, the angle judgment mode of the human face is determined in an auxiliary mode according to 68 human face labeling positions extracted from a human face area, and the angle is calculated by using an affine transformation matrix between a standard human face and a detected human face.

Because the human expression is false, when depfake makes a false video, it is required to provide several pictures of the target human, and these pictures should cover multiple expressions and multiple angles as much as possible. However, in actual operation, too many similar target person pictures are often not obtained, so that the trained deep fake model is often easy to be over-fitted, that is, unnatural states such as 'expression stiffness' appear in a generated video image. Whereas the people in the source video would not. Therefore, the expression recognition algorithm can be combined with the LSTM long-short term memory network sensitive to the time domain range change by utilizing the point, and Binary-Cross-entry loss function is adopted to carry out Binary training on the output. The data set utilizes deep take to generate dummy video and unmodified non-repeating video strips. And judging whether the character expression in the input video is stiff or not, and carrying out weighted average by combining the two combinations to judge whether the video is modified by a deep Fake algorithm or not.

Further, in order to improve the training aspect of the expression action classification model, the training mode of the expression action classification model is that false video data is led into a long-short term memory network to extract expression characteristics, real and false video classification is carried out through the expression action classification model, and the expression action classification model is an expression false classification model.

Further, in order to improve the aspect of expression feature extraction, the expression feature extraction mode of the long-term and short-term memory network is to capture the characteristic of the expression change features in the time domain, so that the video segment with unnatural expression changes is judged by the expression false classification model, and the purpose of detecting the false video is achieved.

Further, a loss function in the expression action classification model is defined to be improved, the loss function defined by the expression action classification model is a Binary-Cross-Entrophy loss function, and Binary classification training is carried out on the expression action classification model by using the Binary-Cross-Entrophy loss function.

Further, to improve on defining the loss function, in S4, the loss function is a cross entropy loss function of the binary classification.

Further, to improve on the optimizer, in S4, the optimizer is an Adam optimizer.

Further, to improve on optimizing the loss function, the final loss value of the optimized loss function is below 1 e-6.

Further, in order to improve the classification threshold, the classification threshold is a threshold for selecting an optimal ROC curve by using an "elbow rule".

Further, to improve on the activation layer function, the activation layer function is a Sigmoid activation function.

The invention provides a method for rapidly judging whether an input video file is transformed by similar algorithms such as DeepFake and the like by using a computer vision processing technology similar to DeepFake, which can help to discriminate a false video, and improves the accuracy of false video detection by adopting three feature extraction models.

Drawings

Fig. 1 is a general flow diagram of the present invention.

FIG. 2 is a schematic diagram of training a generative confrontation network feature extraction model.

FIG. 3 is a schematic diagram of training a front/side face contrast feature extraction model.

Fig. 4 is a schematic diagram of the expression and motion classification feature extraction model training.

Detailed Description

Embodiments of the present invention are described with reference to fig. 1-4: the method is divided into two stages, namely a model training stage and an algorithm application stage.

In the training stage, a deep fake correlation open source algorithm is downloaded in advance, 3500 false videos with the duration of 3 minutes and 2000 unmodified videos are generated by utilizing own data, 15% of 5500 videos are extracted to serve as a test set, and the rest 85% of videos serve as a training set.

As shown in fig. 1, three feature extraction models, namely, a "generative confrontation network classification model", a "front/side face comparison model", and an "expression and motion classification model" are trained in advance, and then the trained three models are stored. And then, respectively inputting the training set into the three models for feature extraction, and performing linear fitting on the extracted features, wherein a fitting object is a real value of the training set, and the test set is used for quantifying the model training quality during model training.

As shown in fig. 2, the training mode of the "generative confrontation network distinguishing model" is to perform comparative learning by using samples extracted from a real video and samples extracted from a model generated by inputting random parameters into a deep fake model, where the samples are human face regions, identify unnatural color changes and splices around false human face regions by the confrontation model in the GAN, and classify the real and false videos, so as to define a loss function, and the confrontation model in the GAN is a distinguishing model.

As shown in fig. 3, the training mode of the "front/side face comparison model" is to import the false video data into an algorithm for face detection and angle judgment, extract a front face sample and a side face sample, extract an average face sample from the front face sample, extract face features from the first face recognition model, import the side face sample into the second face recognition model to extract face features, the "front/side face comparison model" is a face feature comparison model, import the face features extracted from the first face recognition model and the face features extracted from the second face recognition model into the face feature comparison model, and train and classify real and false videos.

As shown in fig. 4, the "expression motion classification model" is trained by importing the false video data into a long-term and short-term memory network to perform expression feature extraction, and performing real and false video classification by the "expression motion classification model", which is an expression false classification model.

Combining an expression recognition algorithm with an LSTM long-short term memory network sensitive to time domain range change, outputting a data set for 'expression action classification model' training, and generating 3500 pieces of video with the duration of 1min and 2000 pieces of unmodified non-repetitive video by means of a DeepFake loss function for performing classification training, judging whether the character expression in the input video is 'stiff', and performing weighted average by combining the two combinations to judge whether the video is modified by the DeepFake algorithm.

And after the training of the three feature extraction models is completed, optimizing by using an Adam optimizer by using a cross entropy loss function of binary classification as a standard. Taking a characteristic extraction model with a loss function of the parameter value with the final loss below 1e-6 as a final model for storage; and traversing the training set again, selecting the optimal threshold value of the ROC curve as a classification threshold value by utilizing an elbow rule, and entering an algorithm application stage.

In the application stage, a video to be classified is cut into a plurality of video segments with the duration not more than 3 minutes for classification respectively, and the video segments are firstly input into three pre-trained models for feature extraction. And then, bringing the extracted features into a linear regression model and extracting the final activation layer for output, wherein the activation layer function is a Sigmoid activation function, and if the average value of the results is greater than a classification threshold value, the video is judged to be a false video.

The invention provides a method for rapidly judging whether an input video file is transformed by a similar algorithm such as deep Fake by using a computer vision processing technology similar to deep Fake, can help to discriminate a false video, and improves the accuracy of false video detection by adopting three feature extraction models.

Claims

1. A false video detection method based on computer vision and deep learning algorithm is characterized by comprising the following steps:

s1: downloading an algorithm for generating a false video in advance, and generating the false video and an unmodified video by utilizing own data, wherein one part of the false video and the unmodified video are used as a test set and the other part of the false video and the unmodified video are used as a training set;

s3: respectively inputting the training sets into the feature extraction model for feature extraction, and performing linear fitting on the extracted features, wherein a fitting object is a real value of the training set, and the test set is used for quantifying the training quality of the model when the model is trained;

s5: saving the feature extraction model with the optimized loss function as a final model; traversing the training set again, selecting a classification threshold value and entering an algorithm application stage;

s7: bringing the extracted features into a linear regression model and extracting the final activation layer output, and if the average value of the results is greater than a classification threshold value, judging that the video is a false video;

in S2, three feature extraction models are trained in advance, namely a generating confrontation network distinguishing model, a front/side face comparison model and an expression action classification model, and then are stored respectively;

the training mode of the generated countermeasure network distinguishing model is to utilize samples extracted from a real video and random parameters to be input into a deep fake model to generate samples extracted from the model for comparative learning, the samples are human face areas, unnatural color changes and splicing around false human face areas are identified by the countermeasure model in the GAN, the real and false videos are classified, and therefore a loss function is defined, and the countermeasure model in the GAN is a distinguishing model;

the training mode of the front/side face comparison model is that false video data is led into an algorithm for face detection and angle judgment, a front face sample and a side face sample are extracted, an average face sample is extracted from the front face sample and is led into a first face recognition model for face feature extraction, the side face sample is led into a second face recognition model for face feature extraction, the front/side face comparison model is a face feature comparison model, face features extracted from the first face recognition model and face features extracted from the second face recognition model are led into the face feature comparison model, and true and false videos are trained and classified;

the 'expression action classification model' training mode is that false video data is imported into a long-term and short-term memory network for expression feature extraction, and real and false video classification is carried out by the 'expression action classification model', wherein the 'expression action classification model' is an expression false classification model;

the expression feature extraction mode of the long-term and short-term memory network is to capture the characteristic of the expression change feature in the time domain, so that the video segment with unnatural expression change is judged by the expression false classification model, and the purpose of detecting the false video is achieved.

2. The method for detecting false video based on computer vision and deep learning algorithm according to claim 1, wherein in S1, the algorithm for false video generation is a deep take correlation open source algorithm.

3. The method for detecting false videos based on computer vision and deep learning algorithm as claimed in claim 1, wherein in S1, the duration of the false videos and the unmodified videos is 1-3 minutes, wherein the number of false videos is 3000-3500, the number of unmodified videos is 2000-2500, 15% -20% of all videos are extracted as a test set, and the remaining 80% -85% are used as a training set.

4. The false video detection method based on computer vision and deep learning algorithm as claimed in claim 1, wherein the face feature comparison method is face angle determination, the face angle determination method is assisted by extracting 68 face labeling positions according to the face region, and the angle is calculated by using affine transformation matrix between the standard face and the detected face.

5. The false video detection method based on computer vision and deep learning algorithm of claim 1, wherein the loss function defined by the expression motion classification model is Binary-Cross-entry loss function, and the Binary-Cross-entry loss function is used to perform Binary classification training on the expression motion classification model.