CN112488013B - Depth-forged video detection method and system based on time sequence inconsistency - Google Patents

Depth-forged video detection method and system based on time sequence inconsistency Download PDF

Info

Publication number
CN112488013B
CN112488013B CN202011417127.5A CN202011417127A CN112488013B CN 112488013 B CN112488013 B CN 112488013B CN 202011417127 A CN202011417127 A CN 202011417127A CN 112488013 B CN112488013 B CN 112488013B
Authority
CN
China
Prior art keywords
video
network
lstm
time sequence
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011417127.5A
Other languages
Chinese (zh)
Other versions
CN112488013A (en
Inventor
陈龙
陈函
邱林坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202011417127.5A priority Critical patent/CN112488013B/en
Publication of CN112488013A publication Critical patent/CN112488013A/en
Application granted granted Critical
Publication of CN112488013B publication Critical patent/CN112488013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a depth counterfeit video detection method and system based on time sequence inconsistency, and belongs to the field of video detection. The method comprises the following steps: s1, acquiring a video data set, preprocessing the data and acquiring a face image of a video frame; s2, inputting the video frame into the attention mechanism module network of the fine-tuning network Xtitle + convolution module for training, and extracting the video frame level characteristics; s3, extracting the characteristics of the continuous video frames by using the trained Xscene network, and inputting the extracted characteristics into a bidirectional long-short term memory network + conditional random field network model for training; and S4, performing forgery detection on the video to be tested by using the trained model. The invention utilizes the forgery technology to cause the time sequence inconsistency of the video between frames, combines the bidirectional long-short term memory network and the conditional random field algorithm, and improves the detection of the deep forgery video to a certain extent.

Description

Depth-forged video detection method and system based on time sequence inconsistency
Technical Field
The invention belongs to the field of video detection, and relates to a depth forgery video detection method and system based on time sequence inconsistency.
Background
With social development and technological progress, more and more people share their lives on social software by sharing photos and videos. However, as video counterfeiting tools are more diversified (Adobe Premiere, Adobe Photoshop, Lightworks), people can forge videos more conveniently, and some lawbreakers gain benefits by forging photos and videos. Meanwhile, with the rise of machine learning technology, the combination of deep learning and video counterfeiting technology is adopted, and face counterfeiting is carried out through the training of a codec, so that the counterfeit video is more difficult to distinguish true from false. For example, the face changing software ZAO needs only one picture to change the face in a video into the face in the picture. These counterfeiting techniques challenge the integrity, authenticity and reliability of the video, which can have serious consequences to both individuals and society.
The research on the depth counterfeit video mainly focuses on the research on the face-changing video, and the detection method mainly comprises the detection of video frames and the detection of video frames. The current main detection method focuses on the feature research in video frames, when face counterfeiting is carried out by deep learning, defects such as shaking and the like appear on the local part of the face due to reasons such as inconsistent resolution, inconsistent illumination and the like, so that the authenticity of the video is researched and analyzed by the features of each frame in the video, and the authenticity of the video is judged by automatically learning and capturing the defects by using relevant deep learning knowledge. The method is characterized in that the method comprises the steps of detecting the time sequence inconsistency of a previous frame and a next frame in the sequence of expression, illumination and the like, and detecting the time sequence inconsistency by using an intra-frame detection method.
Disclosure of Invention
In view of the above, the present invention provides a method and a system for detecting a depth-forged video based on timing inconsistency.
In order to achieve the purpose, the invention provides the following technical scheme:
a depth forgery video detection method based on time sequence inconsistency comprises the following steps:
s1, acquiring an experimental data set, dividing the data set into a training set, a verification set and a test set, processing a video into video frames, extracting faces in the video frames, and processing the video into video frames only containing the faces;
s2, inputting the processed video frame into a fine-tuning Xscene network, wherein the Xscene network is added with a attention mechanism module CBAM of a convolution module, training the network by combining space and channels, and storing parameters when the model achieves the best effect;
s3, extracting the characteristics of each continuous K frame sequence of the video by using a trained Xcenter model, extracting the characteristics of each K frame as a group to be used as the input of a bidirectional long-short term memory network Bilstm for training, adding a conditional random field CRF to adjust the prediction result of the Bilstm, and storing the parameters when the model achieves the best effect;
and S4, detecting the video to be tested by using the trained bidirectional long-short term memory network Bilstm, and evaluating the performance of the model by outputting the accuracy of the test and the like.
Optionally, the step S1 specifically includes:
s11, dividing the video into training, verifying and testing sets according to a certain proportion, marking the real video and the forged video, and then taking frames of a certain proportion for each video according to the video frame rate;
and S12, detecting a face area of the acquired frame through a face detector, aligning the face area through a face landmark and then normalizing the image according to certain pixels.
Optionally, the step S2 specifically includes:
s21, introducing an attention mechanism module CBAM of a convolution module before a global pooling layer of the Xconcentration network, combining the channel attention and the spatial attention module sequence together, deducing respective attention weights, multiplying the attention weights by a feature map extracted by the Xconcentration network block-14, automatically adjusting the features to a certain degree, and finally finely adjusting the Xconcentration + CBAM network;
and S22, inputting the labeled video frame into an Xcaption + CBAM network for training video frame feature extraction, and storing parameters when the model achieves the best effect.
Optionally, the step S3 specifically includes:
s31, extracting the characteristics of continuous N frames of the video by using the trained Xtitle + CBAM network; a full connection layer is arranged behind a global pooling layer of the Xconcept network, a 512-dimensional feature map is output, and the feature map is used as the input of a bidirectional long-short term network Bilstm;
s32, the Bilstm consists of a forward lstm and a backward lstm; respectively inputting the extracted frame characteristics of the video sequence into a forward lstm and a backward lstm for time sequence analysis, and splicing characteristic vectors generated by the forward lstm and the backward lstm by combining context information of the video frame and then performing classified prediction;
s33, controlling forgetting and memorizing information by the lstm through a forgetting gate, an input gate and an output gate, and transmitting useful information for subsequent time sequence analysis so as to predict the current frame by combining the useful information with the context information of the video frame characteristics; the calculation of the three lstm gates is as follows:
forget the door: determining which unimportant information is forgotten by calculation, and reserving which important information; the formula is as follows:
f t =σ(b f [h t-1 ,x t ]+k f )
wherein f is t The value is 0-1 and represents the reservation condition of the network state at the last moment;
an input gate: the new information is added and updated by calculation, and the formula is:
i t =σ(b i [h t-1 ,x t ]+k i )
Figure BDA0002819006970000031
Figure BDA0002819006970000032
i t indicates a value to be updated;
Figure BDA0002819006970000033
representing new candidate network information, and Ct representing updating of the network state;
an output gate: and judging the state characteristics of the output network by combining the updated information, wherein the formula is as follows:
O t =σ(b o [h t-1 ,x t ]+k o )
h t =O t *tanh(C t )
O t a judgment condition indicating an output; h is t Representing the final output;
wherein b and k in the formula represent weight matrix and bias respectively, and sigma is sigmoid function, [ h ] t-1 ,x t ]Output h indicating the last state t-1 Input x with current state t Splicing is carried out;
s34, calculating the timing outputs h of the forward lstm and the backward lstm t Inputting the probabilities of each label after splicing into a Conditional Random Field (CRF), and selecting the optimal time sequence output by the CRF layer through self-learning some constraints for classifying forged videos; scoring the prediction result through a transfer matrix and a loss function of the CRF, and finally selecting the highest scoring as the final prediction sequence;
and S35, inputting the features extracted in the S2 into a Bilstm + CRF for training, and storing parameters when the model achieves the best effect.
Optionally, the step S4 specifically includes:
s41, inputting the test set into the trained model, and taking the evaluation result of continuous K frames to classify the video;
s42, calculating Accuracy Accuracy, Precision, Recall and F1 to evaluate the performance of the detection method.
The depth forgery video detection system based on time sequence inconsistency comprises the following units: the device comprises a data preprocessing module, a video frame feature extraction module, a video frame time sequence analysis module and a fake video classification module;
the data preprocessing module is used for dividing a training set, a verification set and a test set into a data set, framing the video according to a frame rate, extracting a face according to face landmark alignment and normalizing an obtained face picture;
the video frame feature extraction module introduces an attention mechanism module CBAM of the convolution module to learn better video frame level features;
video frame timing analysis module, considering that lstm in one direction cannot take into account future information, therefore by dual
According to the long-short term memory network Bilstm, the consistency of the time sequence of an input sequence is analyzed by combining the context information of the characteristic sequence, and finally the prediction result of the Bilstm is optimized through a conditional random field CRF;
and the counterfeit video classification module inputs the test set into the whole network for detection, and evaluates the performance of the system by calculating the Accuracy index.
The invention has the beneficial effects that:
the method and the system for detecting the deep forged video based on the time sequence inconsistency provided by the invention can extract more detailed characteristics for the subsequent analysis by improving the Xconcept network, and further capture the inconsistency among frames by inputting the extracted characteristics into the bidirectional long-short term memory network (Bilstm) time sequence analysis, thereby effectively overcoming the misjudgment probability caused by the intra-frame detection and fully utilizing the context information. And then optimizing the output of the Bilstm through a Conditional Random Field (CRF) to obtain the optimal test result, and greatly improving the detection precision of the video.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a schematic flow chart of a depth-based video detection method based on temporal inconsistency according to the present invention;
FIG. 2 is a diagram showing a network structure of the Xtitle + CBAM in step S2;
FIG. 3 is a schematic diagram of the temporal characteristics analyzed by the Bilstm + CRF in step S3;
FIG. 4 is a diagram of the calculation process of the LSTM three gates in step S3;
fig. 5 is a schematic structural diagram of a depth-forgery-inhibited video detection system based on time sequence inconsistency according to the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; for a better explanation of the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
The invention relates to a method and a system for detecting a deep forged video through the time sequence consistency among video frames. By combining the advantages of the Xception network and the CBAM, an attention mechanism module (CBAM) of a convolution module is added in the Xception network, output feature maps of a channel and a space attention sub-module are connected and multiplied by a feature map extracted by the Xception network to achieve an optimization effect, so that the Xception network can capture more useful information for later time sequence analysis when extracting a feature sequence.
On the other hand, when analyzing the video time sequence, the unidirectional long and short term memory network (lstm) only memorizes the previous historical information and ignores the future time sequence information, therefore, when analyzing the time sequence, the bidirectional long and short term memory network (Bilstm) is adopted, and the information extracted by the forward and backward lstm is connected and input to the Conditional Random Field (CRF) for optimization to obtain the final detection result. By improving the Xconvergence network and combining the advantages of the Bilstm in time sequence analysis, the method not only reduces the training parameters, but also improves the detection accuracy of the forged video.
The specific implementation mode is as follows:
first embodiment
Fig. 1 is a schematic flowchart of a method for detecting timing inconsistency according to an exemplary embodiment, where the method includes the following steps:
step S1: dividing the obtained counterfeit video set into a training set, a verification set and a test set according to a certain proportion, wherein the number of real videos and counterfeit videos is equal, marking the real videos and the counterfeit videos, marking the real videos as 0 and marking the counterfeit videos as 1; taking a certain number of frames for each video according to the frame rate by using ffmpeg; and detecting a face region of the obtained frame by an mtcnn face detector, aligning face landmarks of the face region, storing a face image, and carrying out normalization processing according to 240 × 240 pixels.
Step S2: and inputting the processed video frame of S1 into an Xcaption network for training of feature extraction. Some channel and spatial information is lost to the output due to the global pooling layer of the Xception network. Therefore, an attention mechanism module CBAM of the convolution module is introduced before the global pooling layer of the Xconcentration network, and the attention of the channel and the attention of the space module are combined together to deduce the respective attention weight. The main operations in CBAM are:
Figure BDA0002819006970000061
Figure BDA0002819006970000062
where F is the feature extracted by the Xconcept network, Mc is the processing on the channel module, and Ms is the processing on the space module.
As shown in the network structure diagram of Xception + CBAM in fig. 2, firstly, the features extracted from the Xception network are input into the channel attention module, and are respectively subjected to global average pooling and global maximum pooling, and then are respectively input into the neural network for processing, and the calculation formula is as follows:
Mc(F)=σ(W 1 (W 0 (Fcavg))+W 1 (W 0 (Fcmax)))
where Mc is a channel profile, W 1 ,W 0 Is the weight of the multilayer perceptron, Fcavg, Fcmax are the global average pooling and global maximum pooling, respectively, σ is sigmoid function。
Inputting the new feature F' obtained by the channel attention module into a space attention module for calculation, simultaneously carrying out global maximum pooling and global average pooling, carrying out convolution, and finally multiplying the new feature by the original feature to obtain a new feature, wherein the calculation formula is as follows:
Figure BDA0002819006970000063
wherein Ms is a spatial feature map, f 7×7 It is shown that a convolution of 7 x 7 is performed,
Figure BDA0002819006970000064
and representing the splicing of the features subjected to global average pooling and global maximum pooling, wherein sigma is a sigmoid function.
And finally, inputting the new feature F' obtained by processing of the CBAM module into a global average paging layer of the Xconcept network, and then finely adjusting the parameters of the Xconcept + CBAM network. And inputting the labeled video frame into an Xcaption + CBAM network for training end-to-end video frame feature extraction, and storing parameters when the model achieves the best effect.
Step S3: the process of analyzing the timing characteristics by combining the bidirectional long and short term network (Bilstm) and the Conditional Random Field (CRF) is shown in FIG. 3, and the main steps are as follows:
extracting the characteristics of continuous 25 frames of video by using a trained Xcaption + CBAM network, obtaining an output 512-dimensional time sequence characteristic diagram after passing through a Global Average Pooling layer of the Xcaption network, and carrying out the operation
The feature map is used as an input of a bidirectional long-short term network (Bilstm), the input format of the Bilstm is (samples, times, dim), samples represents the total number of samples, times represents the time sequence length of processing, and dim represents the dimension number of the input feature. The blstm is composed of a forward lstm and a backward lstm. The method comprises the steps of respectively inputting the characteristics of video sequence frames extracted by an Xscene + CBAM network into a forward lstm and a backward lstm for time sequence analysis, splicing characteristic vectors generated by the forward lstm and the backward lstm by combining context information of video frames, and then carrying out classified prediction. The lstm controls the forgetting and memorizing information through a forgetting gate, an input gate and an output gate, and transmits useful information for subsequent time sequence analysis, so that the useful information is combined with the context information of the video frame characteristics to predict the current frame. The calculation to obtain the three lstm gates as shown in FIG. 4 is as follows:
forget the door: and determining which unimportant information is forgotten by calculation, and retaining which important information. The formula is (wherein f) t Is a value from 0 to 1 indicating the retention of the network state at the previous time):
f t =σ(b f [h t-1 ,x t ]+k f )
an input gate: the new information is added and the information is updated through calculation, and the formula is (i) t Indicates a value to be updated; c t Representing new candidate cell information, Ct representing updating the network state):
i t =σ(b i [h t-1 ,x t ]+k i )
Figure BDA0002819006970000071
Figure BDA0002819006970000072
an output gate: the updated information is combined to judge the state characteristics of the output network, and the formula is (O) t A judgment condition indicating an output; h is t Representing the final output):
O t =σ(b o [h t-1 ,x t ]+k o )
h t =O t *tanh(C t )
wherein b and k in the formula represent weight matrix and bias respectively, and sigma is sigmoid function, [ h ] t-1 ,x t ]Output h indicating the last state to be t-1 Input x with current state t And (6) splicing. The respective timings of the forward lstm and backward lstm are output as h through the above calculation t And splicing to obtain the probability of each label of each video frame.
These sequences of probability components are input into a Conditional Random Field (CRF) which, by itself learning some constraints, selects the best timing output for classification of counterfeit videos. The prediction results are scored mainly by the transition matrix and loss function of the CRF, and finally the highest scoring is selected as the final prediction sequence. The CRF scoring sequence was performed as follows:
Figure BDA0002819006970000073
wherein A is a label transfer score automatically learned by CRF during training, X is an input feature sequence, Y is a corresponding output label, and P is the prediction probability of Bilstm; the calculation process of the loss function is as follows:
Figure BDA0002819006970000074
according to the above steps, the features extracted in S2 are input into the blstm + CRF for training, and the parameters at which the model achieves the best effect are saved.
Step S4: and inputting the test set into the trained model to obtain an evaluation result of continuous K frames, and performing true and false classification on the video to be detected according to the obtained probability. And finally, evaluating the performance of the invention by calculating Accuracy (Accuracy), Precision (Precision), Recall (Recall), F1 and the like.
Second embodiment
Referring to fig. 5, a depth-forgery-video detection system based on video inter-frame consistency is characterized in that the method includes the following units: the device comprises a data preprocessing module, a video frame feature extraction module, a video frame time sequence analysis module and a fake video classification module.
The data preprocessing module is used for processing experimental data and mainly comprises three units: dividing a data set, taking frames and extracting faces. When a data set is divided, the data set is divided into a training set, a verification set and a test set according to a certain proportion; and then framing the video, finally extracting the face and normalizing the obtained face picture into uniform pixels, when the face is extracted, firstly framing a face area by using a face detector, and then extracting the face by aligning the face landmarks, thereby improving the detection rate of the face.
The video frame feature extraction module mainly comprises two units: processing of Xception networks, attention mechanism module (CBAM) of convolution module. Considering that the output of the Xconcept network passing through the global pooling layer loses channel information and spatial information to a certain extent, a convolution mode is introduced
The block attention mechanism module (CBAM) calculates the channel and spatial importance to extract more video frame level semantic features. In the CBAM part, firstly, the video frame passes through the first 14 blocks of the Xcenter network to extract the feature map, and then the extracted feature map is input to the channel attention module for processing and then enters the spatial attention module.
The video frame time sequence analysis module mainly comprises time sequence analysis of a bidirectional long-short term memory network (Bilstm) and result optimization of a Conditional Random Field (CRF). Considering that the unidirectional lstm cannot consider future information, the input sequence is subjected to time sequence consistency analysis through a bidirectional long-short term memory network (Bilstm) in combination with the context information of the feature sequence, and the feature sequence of the current input is comprehensively analyzed by splicing the calculated results of the forward lstm and the backward lstm. And finally, optimizing the prediction result of the Bilstm by a Conditional Random Field (CRF).
And the counterfeit video classification module is used for inputting the test set into the whole network for detection and classifying the videos according to the final output probability.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (4)

1. The depth forgery video detection method based on time sequence inconsistency is characterized by comprising the following steps: the method comprises the following steps:
s1, acquiring an experimental data set, dividing the data set into a training set, a verification set and a test set, processing a video into video frames, extracting faces in the video frames, and processing the video into video frames only containing the faces;
s2, inputting the processed video frame into an Xscene network, wherein an attention mechanism module CBAM of a convolution module is added into the Xscene network, and the network is trained by combining a space and a channel, and parameters when the model achieves the best effect are stored;
s3, extracting the characteristics of each continuous K frame sequence of the video by using a trained Xconcept model, extracting the characteristics of each K frame as a group to be used as the input of a bidirectional long-short term memory network (Bilstm) for training, wherein a Conditional Random Field (CRF) is added to adjust the prediction result of the Bilstm, and parameters when the model achieves the best effect are stored;
s4, detecting the video to be tested by using the trained bidirectional long-short term memory network Bilstm, and evaluating the performance of the model by outputting the accuracy of the test;
the step S2 specifically includes:
s21, introducing an attention mechanism module CBAM of a convolution module before a global pooling layer of the Xconcentration network, combining the channel attention and the space attention module sequence, deducing respective attention weights, multiplying the attention weights by a feature map extracted by the Xconcentration network block-14, automatically adjusting the features, and finally adjusting the Xconcentration + CBAM network;
s22, inputting the video frame with the label into an Xconvergence + CBAM network for training video frame feature extraction, and storing parameters when the model achieves the best effect;
the step S3 specifically includes:
s31, extracting the characteristics of continuous N frames of the video by using the trained Xtitle + CBAM network; a full connection layer is connected behind a global pooling layer of the Xconcept network, a 512-dimensional feature map is output, and the feature map is used as the input of a bidirectional long-short term network Bilstm;
s32, the Bilstm consists of a forward lstm and a backward lstm; respectively inputting the extracted frame characteristics of the video sequence into a forward lstm and a backward lstm for time sequence analysis, and splicing characteristic vectors generated by the forward lstm and the backward lstm by combining context information of the video frame and then performing classified prediction;
s33, controlling forgetting and memorizing information by the lstm through a forgetting gate, an input gate and an output gate, and transmitting useful information for subsequent time sequence analysis so as to predict the current frame by combining the useful information with the context information of the video frame characteristics; the calculation of the three lstm gates is as follows:
forget the door: determining which unimportant information is forgotten through calculation, and reserving which important information; the formula is as follows:
f t =σ(b f [h t-1 ,x t ]+k f )
wherein f is t The value is 0-1, and represents the reservation condition of the network state at the previous moment;
an input gate: the new information is added and updated by calculation, and the formula is:
i t =σ(b i [h t-1 ,x t ]+k i )
Figure FDA0003773112400000022
Figure FDA0003773112400000023
i t indicates a value to be updated;
Figure FDA0003773112400000021
representing new candidate network information, and Ct representing updating of the network state;
an output gate: and judging the state characteristics of the output network by combining the updated information, wherein the formula is as follows:
O t =σ(b o [h t-1 ,x t ]+k o )
h t =O t *tanh(C t )
O t a judgment condition indicating an output; h is t Representing the final output;
wherein b and k in the formula represent weight matrix and offset, respectively, and σ is sigmoid function, [ h ] t-1 ,x t ]Output h indicating the last state t-1 Input x with current state t Splicing is carried out;
s34, calculating the above, and outputting the timing h of each of the forward and backward lstm t Inputting the probabilities of each label after splicing into a Conditional Random Field (CRF), and selecting the optimal time sequence output by the CRF layer through self-learning constraint for classifying forged videos; scoring the prediction result through a transfer matrix and a loss function of the CRF, and finally selecting the highest scoring as the final prediction sequence;
and S35, inputting the features extracted in the S2 into a Bilstm + CRF for training, and storing parameters when the model achieves the best effect.
2. The method for detecting depth-forged video based on time sequence inconsistency according to claim 1, wherein: the step S1 specifically includes:
s11, dividing the video into training, verifying and testing sets according to a certain proportion, marking the real video and the forged video, and then taking frames of a certain proportion for each video according to the video frame rate;
and S12, detecting a face area of the acquired frame through a face detector, aligning the face area through a face landmark and then normalizing the image according to certain pixels.
3. The method for detecting depth-forged video based on time sequence inconsistency according to claim 1, wherein: the step S4 specifically includes:
s41, inputting the test set into the trained model, and taking the evaluation result of continuous K frames to classify the video;
s42, calculating Accuracy Accuracy, Precision, Recall and F1 to evaluate the performance of the detection method.
4. A depth forgery video detection system based on time sequence inconsistency is characterized in that: the system comprises the following units: the device comprises a data preprocessing module, a video frame feature extraction module, a video frame time sequence analysis module and a fake video classification module;
the data preprocessing module is used for dividing a training set, a verification set and a test set into a data set, framing the video according to a frame rate, extracting a face according to face landmark alignment and normalizing an obtained face picture;
the video frame feature extraction module introduces an attention mechanism module CBAM of the convolution module to learn better video frame level features; the method specifically comprises the following steps:
s21, introducing an attention mechanism module CBAM of a convolution module before a global pooling layer of the Xconcentration network, combining the channel attention and the spatial attention module sequence together, deducing respective attention weights, multiplying the attention weights by a feature map extracted by the Xconcentration network block-14, automatically adjusting the features, and finally adjusting the Xconcentration + CBAM network;
s22, inputting the video frame with the label into an Xconvergence + CBAM network for training video frame feature extraction, and storing parameters when the model achieves the best effect;
the video frame time sequence analysis module is used for considering that future information cannot be considered by unidirectional lstm, performing consistency analysis on the time sequence of an input sequence by a bidirectional long-short term memory network (Bilstm) in combination with context information of a characteristic sequence, and finally optimizing a prediction result of the Bilstm through a Conditional Random Field (CRF);
the method specifically comprises the following steps:
s31, extracting the characteristics of continuous N frames of the video by using the trained Xceptation + CBAM network; a full connection layer is arranged behind a global pooling layer of the Xconcept network, a 512-dimensional feature map is output, and the feature map is used as the input of a bidirectional long-short term network Bilstm;
s32, the Bilstm consists of a forward lstm and a backward lstm; respectively inputting the extracted frame characteristics of the video sequence into a forward lstm and a backward lstm for time sequence analysis, and splicing characteristic vectors generated by the forward lstm and the backward lstm by combining context information of the video frame and then performing classified prediction;
s33, controlling forgetting and memorizing information by the lstm through a forgetting gate, an input gate and an output gate, and transmitting useful information for subsequent time sequence analysis so as to predict the current frame by combining the useful information with the context information of the video frame characteristics; the calculation of the three lstm gates is as follows:
forget the door: determining which unimportant information is forgotten by calculation, and reserving which important information; the formula is as follows:
f t =σ(b f [h t-1 ,x t ]+k f )
wherein f is t The value is 0-1 and represents the reservation condition of the network state at the last moment;
an input gate: the new information is added and updated by calculation, and the formula is:
i t =σ(b i [h t-1 ,x t ]+k i )
Figure FDA0003773112400000041
Figure FDA0003773112400000042
i t indicates a value to be updated;
Figure FDA0003773112400000043
representing new candidate network information, Ct representing counterpoiseUpdating the link state;
an output gate: and judging the state characteristics of the output network by combining the updated information, wherein the formula is as follows:
O t =σ(b o [h t-1 ,x t ]+k o )
h t =O t *tanh(C t )
O t a judgment condition indicating an output; h is a total of t Representing the final output;
wherein b and k in the formula represent weight matrix and bias respectively, and sigma is sigmoid function, [ h ] t-1 ,x t ]Output h indicating the last state to be t-1 Input x with current state t Splicing is carried out;
s34, calculating the timing outputs h of the forward lstm and the backward lstm t Inputting the probabilities of each label after splicing into a Conditional Random Field (CRF), and selecting the optimal time sequence output by the CRF layer through self-learning constraint for classifying forged videos; scoring the prediction result through a transfer matrix and a loss function of the CRF, and finally selecting the highest scoring as the final prediction sequence;
s35, inputting the features extracted in S2 into Bilstm + CRF for training, and storing parameters when the model achieves the best effect;
and the counterfeit video classification module inputs the test set into the whole network for detection, and evaluates the performance of the system by calculating an Accuracy index.
CN202011417127.5A 2020-12-04 2020-12-04 Depth-forged video detection method and system based on time sequence inconsistency Active CN112488013B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011417127.5A CN112488013B (en) 2020-12-04 2020-12-04 Depth-forged video detection method and system based on time sequence inconsistency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011417127.5A CN112488013B (en) 2020-12-04 2020-12-04 Depth-forged video detection method and system based on time sequence inconsistency

Publications (2)

Publication Number Publication Date
CN112488013A CN112488013A (en) 2021-03-12
CN112488013B true CN112488013B (en) 2022-09-02

Family

ID=74940255

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011417127.5A Active CN112488013B (en) 2020-12-04 2020-12-04 Depth-forged video detection method and system based on time sequence inconsistency

Country Status (1)

Country Link
CN (1) CN112488013B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205044B (en) * 2021-04-30 2022-09-30 湖南大学 Deep fake video detection method based on characterization contrast prediction learning
CN113326400B (en) * 2021-06-29 2024-01-12 合肥高维数据技术有限公司 Evaluation method and system of model based on depth fake video detection
CN113570564B (en) * 2021-07-21 2024-02-27 同济大学 Multi-definition fake face video detection method based on multi-path convolution network
CN113537110B (en) * 2021-07-26 2024-04-26 北京计算机技术及应用研究所 False video detection method fusing intra-frame differences
CN113989713B (en) * 2021-10-28 2023-05-12 杭州中科睿鉴科技有限公司 Depth forgery detection method based on video frame sequence prediction
CN114550268A (en) * 2022-03-01 2022-05-27 北京赛思信安技术股份有限公司 Depth-forged video detection method utilizing space-time characteristics
CN115273186A (en) * 2022-07-18 2022-11-01 中国人民警察大学 Depth-forged face video detection method and system based on image feature fusion
CN115049969B (en) * 2022-08-15 2022-12-13 山东百盟信息技术有限公司 Bad video detection method for improving YOLOv3 and BiConvLSTM
CN116486464B (en) * 2023-06-20 2023-09-01 齐鲁工业大学(山东省科学院) Attention mechanism-based face counterfeiting detection method for convolution countermeasure network

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108462886A (en) * 2018-05-09 2018-08-28 国网浙江省电力有限公司 Forgery recognition methods based on time frequency analysis
CN109862350A (en) * 2019-02-27 2019-06-07 江南大学 No-reference video quality evaluating method based on time-space domain feature extraction
WO2019134987A1 (en) * 2018-01-05 2019-07-11 Deepmind Technologies Limited Parallel video processing systems
CN110148400A (en) * 2018-07-18 2019-08-20 腾讯科技(深圳)有限公司 The pronunciation recognition methods of type, the training method of model, device and equipment
CN110287875A (en) * 2019-06-25 2019-09-27 腾讯科技(深圳)有限公司 Detection method, device, electronic equipment and the storage medium of video object
CN110782881A (en) * 2019-10-25 2020-02-11 四川长虹电器股份有限公司 Video entity error correction method after speech recognition and entity recognition
CN111353395A (en) * 2020-02-19 2020-06-30 南京信息工程大学 Face changing video detection method based on long-term and short-term memory network
CN111709408A (en) * 2020-08-18 2020-09-25 腾讯科技(深圳)有限公司 Image authenticity detection method and device
CN111914613A (en) * 2020-05-21 2020-11-10 淮阴工学院 Multi-target tracking and facial feature information identification method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110880172A (en) * 2019-11-12 2020-03-13 中山大学 Video face tampering detection method and system based on cyclic convolution neural network
CN111050023A (en) * 2019-12-17 2020-04-21 深圳追一科技有限公司 Video detection method and device, terminal equipment and storage medium
CN111967344B (en) * 2020-07-28 2023-06-20 南京信息工程大学 Face fake video detection oriented refinement feature fusion method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019134987A1 (en) * 2018-01-05 2019-07-11 Deepmind Technologies Limited Parallel video processing systems
CN108462886A (en) * 2018-05-09 2018-08-28 国网浙江省电力有限公司 Forgery recognition methods based on time frequency analysis
CN110148400A (en) * 2018-07-18 2019-08-20 腾讯科技(深圳)有限公司 The pronunciation recognition methods of type, the training method of model, device and equipment
CN109862350A (en) * 2019-02-27 2019-06-07 江南大学 No-reference video quality evaluating method based on time-space domain feature extraction
CN110287875A (en) * 2019-06-25 2019-09-27 腾讯科技(深圳)有限公司 Detection method, device, electronic equipment and the storage medium of video object
CN110782881A (en) * 2019-10-25 2020-02-11 四川长虹电器股份有限公司 Video entity error correction method after speech recognition and entity recognition
CN111353395A (en) * 2020-02-19 2020-06-30 南京信息工程大学 Face changing video detection method based on long-term and short-term memory network
CN111914613A (en) * 2020-05-21 2020-11-10 淮阴工学院 Multi-target tracking and facial feature information identification method
CN111709408A (en) * 2020-08-18 2020-09-25 腾讯科技(深圳)有限公司 Image authenticity detection method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Ahmed F. Hagar等.Emotion Recognition In Videos For Low-Memory Systems Using Deep-Learning.《 2019 14th International Conference on Computer Engineering and Systems (ICCES)》.2020, *
张雪莉.基于内容连续性的数字视频帧间篡改取证研究.<中国优秀博硕士学位论文全文数据库(硕士)信息科技辑(月刊)>.2019,(第09期), *
高逸飞等.5种流行假脸视频检测网络性能分析和比较.《应用科学学报》.2019,第37卷(第5期), *

Also Published As

Publication number Publication date
CN112488013A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
CN112488013B (en) Depth-forged video detection method and system based on time sequence inconsistency
CN108090902B (en) Non-reference image quality objective evaluation method based on multi-scale generation countermeasure network
CN111400547B (en) Human-computer cooperation video anomaly detection method
CN111079640B (en) Vehicle type identification method and system based on automatic amplification sample
CN110880172A (en) Video face tampering detection method and system based on cyclic convolution neural network
CN112800876A (en) Method and system for embedding hypersphere features for re-identification
CN111539351B (en) Multi-task cascading face frame selection comparison method
CN114333070A (en) Examinee abnormal behavior detection method based on deep learning
US9378406B2 (en) System for estimating gender from fingerprints
CN110909657A (en) Method for identifying apparent tunnel disease image
CN116206327A (en) Image classification method based on online knowledge distillation
CN111539456A (en) Target identification method and device
CN111144462A (en) Unknown individual identification method and device for radar signals
CN108154199B (en) High-precision rapid single-class target detection method based on deep learning
CN113011399A (en) Video abnormal event detection method and system based on generation cooperative judgment network
CN117218680A (en) Scenic spot abnormity monitoring data confirmation method and system
CN113962999B (en) Noise label segmentation method based on Gaussian mixture model and label correction model
CN115984639A (en) Intelligent detection method for fatigue state of part
CN115331135A (en) Method for detecting Deepfake video based on multi-domain characteristic region standard score difference
CN115205743A (en) Electrical equipment integrity monitoring method based on TSN and attention LSTM network model
CN115393802A (en) Railway scene unusual invasion target identification method based on small sample learning
CN113989742A (en) Nuclear power station plant pedestrian detection method based on multi-scale feature fusion
CN110312103A (en) A kind of high-speed equipment anti-thefting monitoring method for processing video frequency based on cloud computing platform
CN113658112B (en) Bow net anomaly detection method based on template matching and neural network algorithm
CN116935494B (en) Multi-person sitting posture identification method based on lightweight network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant