CN112488013B

CN112488013B - Depth-forged video detection method and system based on time sequence inconsistency

Info

Publication number: CN112488013B
Application number: CN202011417127.5A
Authority: CN
Inventors: 陈龙; 陈函; 邱林坤
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2022-09-02
Anticipated expiration: 2040-12-04
Also published as: CN112488013A

Abstract

The invention relates to a depth counterfeit video detection method and system based on time sequence inconsistency, and belongs to the field of video detection. The method comprises the following steps: s1, acquiring a video data set, preprocessing the data and acquiring a face image of a video frame; s2, inputting the video frame into the attention mechanism module network of the fine-tuning network Xtitle + convolution module for training, and extracting the video frame level characteristics; s3, extracting the characteristics of the continuous video frames by using the trained Xscene network, and inputting the extracted characteristics into a bidirectional long-short term memory network + conditional random field network model for training; and S4, performing forgery detection on the video to be tested by using the trained model. The invention utilizes the forgery technology to cause the time sequence inconsistency of the video between frames, combines the bidirectional long-short term memory network and the conditional random field algorithm, and improves the detection of the deep forgery video to a certain extent.

Description

Depth-forged video detection method and system based on time sequence inconsistency

Technical Field

The invention belongs to the field of video detection, and relates to a depth forgery video detection method and system based on time sequence inconsistency.

Background

With social development and technological progress, more and more people share their lives on social software by sharing photos and videos. However, as video counterfeiting tools are more diversified (Adobe Premiere, Adobe Photoshop, Lightworks), people can forge videos more conveniently, and some lawbreakers gain benefits by forging photos and videos. Meanwhile, with the rise of machine learning technology, the combination of deep learning and video counterfeiting technology is adopted, and face counterfeiting is carried out through the training of a codec, so that the counterfeit video is more difficult to distinguish true from false. For example, the face changing software ZAO needs only one picture to change the face in a video into the face in the picture. These counterfeiting techniques challenge the integrity, authenticity and reliability of the video, which can have serious consequences to both individuals and society.

The research on the depth counterfeit video mainly focuses on the research on the face-changing video, and the detection method mainly comprises the detection of video frames and the detection of video frames. The current main detection method focuses on the feature research in video frames, when face counterfeiting is carried out by deep learning, defects such as shaking and the like appear on the local part of the face due to reasons such as inconsistent resolution, inconsistent illumination and the like, so that the authenticity of the video is researched and analyzed by the features of each frame in the video, and the authenticity of the video is judged by automatically learning and capturing the defects by using relevant deep learning knowledge. The method is characterized in that the method comprises the steps of detecting the time sequence inconsistency of a previous frame and a next frame in the sequence of expression, illumination and the like, and detecting the time sequence inconsistency by using an intra-frame detection method.

Disclosure of Invention

In view of the above, the present invention provides a method and a system for detecting a depth-forged video based on timing inconsistency.

In order to achieve the purpose, the invention provides the following technical scheme:

a depth forgery video detection method based on time sequence inconsistency comprises the following steps:

s1, acquiring an experimental data set, dividing the data set into a training set, a verification set and a test set, processing a video into video frames, extracting faces in the video frames, and processing the video into video frames only containing the faces;

s2, inputting the processed video frame into a fine-tuning Xscene network, wherein the Xscene network is added with a attention mechanism module CBAM of a convolution module, training the network by combining space and channels, and storing parameters when the model achieves the best effect;

s3, extracting the characteristics of each continuous K frame sequence of the video by using a trained Xcenter model, extracting the characteristics of each K frame as a group to be used as the input of a bidirectional long-short term memory network Bilstm for training, adding a conditional random field CRF to adjust the prediction result of the Bilstm, and storing the parameters when the model achieves the best effect;

and S4, detecting the video to be tested by using the trained bidirectional long-short term memory network Bilstm, and evaluating the performance of the model by outputting the accuracy of the test and the like.

Optionally, the step S1 specifically includes:

s11, dividing the video into training, verifying and testing sets according to a certain proportion, marking the real video and the forged video, and then taking frames of a certain proportion for each video according to the video frame rate;

and S12, detecting a face area of the acquired frame through a face detector, aligning the face area through a face landmark and then normalizing the image according to certain pixels.

Optionally, the step S2 specifically includes:

s21, introducing an attention mechanism module CBAM of a convolution module before a global pooling layer of the Xconcentration network, combining the channel attention and the spatial attention module sequence together, deducing respective attention weights, multiplying the attention weights by a feature map extracted by the Xconcentration network block-14, automatically adjusting the features to a certain degree, and finally finely adjusting the Xconcentration + CBAM network;

and S22, inputting the labeled video frame into an Xcaption + CBAM network for training video frame feature extraction, and storing parameters when the model achieves the best effect.

Optionally, the step S3 specifically includes:

s31, extracting the characteristics of continuous N frames of the video by using the trained Xtitle + CBAM network; a full connection layer is arranged behind a global pooling layer of the Xconcept network, a 512-dimensional feature map is output, and the feature map is used as the input of a bidirectional long-short term network Bilstm;

s32, the Bilstm consists of a forward lstm and a backward lstm; respectively inputting the extracted frame characteristics of the video sequence into a forward lstm and a backward lstm for time sequence analysis, and splicing characteristic vectors generated by the forward lstm and the backward lstm by combining context information of the video frame and then performing classified prediction;

s33, controlling forgetting and memorizing information by the lstm through a forgetting gate, an input gate and an output gate, and transmitting useful information for subsequent time sequence analysis so as to predict the current frame by combining the useful information with the context information of the video frame characteristics; the calculation of the three lstm gates is as follows:

forget the door: determining which unimportant information is forgotten by calculation, and reserving which important information; the formula is as follows:

f _t ＝σ(b _f [h _t-1 ,x _t ]+k _f )

wherein f is _t The value is 0-1 and represents the reservation condition of the network state at the last moment;

an input gate: the new information is added and updated by calculation, and the formula is:

i _t ＝σ(b _i [h _t-1 ,x _t ]+k _i )

i _t indicates a value to be updated;

representing new candidate network information, and Ct representing updating of the network state;

an output gate: and judging the state characteristics of the output network by combining the updated information, wherein the formula is as follows:

O _t ＝σ(b _o [h _t-1 ,x _t ]+k _o )

h _t ＝O _t *tanh(C _t )

O _t a judgment condition indicating an output; h is _t Representing the final output;

wherein b and k in the formula represent weight matrix and bias respectively, and sigma is sigmoid function, [ h ] _t-1 ,x _t ]Output h indicating the last state _t-1 Input x with current state _t Splicing is carried out;

s34, calculating the timing outputs h of the forward lstm and the backward lstm _t Inputting the probabilities of each label after splicing into a Conditional Random Field (CRF), and selecting the optimal time sequence output by the CRF layer through self-learning some constraints for classifying forged videos; scoring the prediction result through a transfer matrix and a loss function of the CRF, and finally selecting the highest scoring as the final prediction sequence;

and S35, inputting the features extracted in the S2 into a Bilstm + CRF for training, and storing parameters when the model achieves the best effect.

Optionally, the step S4 specifically includes:

s41, inputting the test set into the trained model, and taking the evaluation result of continuous K frames to classify the video;

s42, calculating Accuracy Accuracy, Precision, Recall and F1 to evaluate the performance of the detection method.

The depth forgery video detection system based on time sequence inconsistency comprises the following units: the device comprises a data preprocessing module, a video frame feature extraction module, a video frame time sequence analysis module and a fake video classification module;

the data preprocessing module is used for dividing a training set, a verification set and a test set into a data set, framing the video according to a frame rate, extracting a face according to face landmark alignment and normalizing an obtained face picture;

the video frame feature extraction module introduces an attention mechanism module CBAM of the convolution module to learn better video frame level features;

video frame timing analysis module, considering that lstm in one direction cannot take into account future information, therefore by dual

According to the long-short term memory network Bilstm, the consistency of the time sequence of an input sequence is analyzed by combining the context information of the characteristic sequence, and finally the prediction result of the Bilstm is optimized through a conditional random field CRF;

and the counterfeit video classification module inputs the test set into the whole network for detection, and evaluates the performance of the system by calculating the Accuracy index.

The invention has the beneficial effects that:

the method and the system for detecting the deep forged video based on the time sequence inconsistency provided by the invention can extract more detailed characteristics for the subsequent analysis by improving the Xconcept network, and further capture the inconsistency among frames by inputting the extracted characteristics into the bidirectional long-short term memory network (Bilstm) time sequence analysis, thereby effectively overcoming the misjudgment probability caused by the intra-frame detection and fully utilizing the context information. And then optimizing the output of the Bilstm through a Conditional Random Field (CRF) to obtain the optimal test result, and greatly improving the detection precision of the video.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic flow chart of a depth-based video detection method based on temporal inconsistency according to the present invention;

FIG. 2 is a diagram showing a network structure of the Xtitle + CBAM in step S2;

FIG. 3 is a schematic diagram of the temporal characteristics analyzed by the Bilstm + CRF in step S3;

FIG. 4 is a diagram of the calculation process of the LSTM three gates in step S3;

fig. 5 is a schematic structural diagram of a depth-forgery-inhibited video detection system based on time sequence inconsistency according to the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; for a better explanation of the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

The invention relates to a method and a system for detecting a deep forged video through the time sequence consistency among video frames. By combining the advantages of the Xception network and the CBAM, an attention mechanism module (CBAM) of a convolution module is added in the Xception network, output feature maps of a channel and a space attention sub-module are connected and multiplied by a feature map extracted by the Xception network to achieve an optimization effect, so that the Xception network can capture more useful information for later time sequence analysis when extracting a feature sequence.

On the other hand, when analyzing the video time sequence, the unidirectional long and short term memory network (lstm) only memorizes the previous historical information and ignores the future time sequence information, therefore, when analyzing the time sequence, the bidirectional long and short term memory network (Bilstm) is adopted, and the information extracted by the forward and backward lstm is connected and input to the Conditional Random Field (CRF) for optimization to obtain the final detection result. By improving the Xconvergence network and combining the advantages of the Bilstm in time sequence analysis, the method not only reduces the training parameters, but also improves the detection accuracy of the forged video.

The specific implementation mode is as follows:

first embodiment

Fig. 1 is a schematic flowchart of a method for detecting timing inconsistency according to an exemplary embodiment, where the method includes the following steps:

step S1: dividing the obtained counterfeit video set into a training set, a verification set and a test set according to a certain proportion, wherein the number of real videos and counterfeit videos is equal, marking the real videos and the counterfeit videos, marking the real videos as 0 and marking the counterfeit videos as 1; taking a certain number of frames for each video according to the frame rate by using ffmpeg; and detecting a face region of the obtained frame by an mtcnn face detector, aligning face landmarks of the face region, storing a face image, and carrying out normalization processing according to 240 × 240 pixels.

Step S2: and inputting the processed video frame of S1 into an Xcaption network for training of feature extraction. Some channel and spatial information is lost to the output due to the global pooling layer of the Xception network. Therefore, an attention mechanism module CBAM of the convolution module is introduced before the global pooling layer of the Xconcentration network, and the attention of the channel and the attention of the space module are combined together to deduce the respective attention weight. The main operations in CBAM are:

where F is the feature extracted by the Xconcept network, Mc is the processing on the channel module, and Ms is the processing on the space module.

As shown in the network structure diagram of Xception + CBAM in fig. 2, firstly, the features extracted from the Xception network are input into the channel attention module, and are respectively subjected to global average pooling and global maximum pooling, and then are respectively input into the neural network for processing, and the calculation formula is as follows:

Mc(F)＝σ(W ₁ (W ₀ (Fcavg))+W ₁ (W ₀ (Fcmax)))

where Mc is a channel profile, W ₁ ，W ₀ Is the weight of the multilayer perceptron, Fcavg, Fcmax are the global average pooling and global maximum pooling, respectively, σ is sigmoid function。

Inputting the new feature F' obtained by the channel attention module into a space attention module for calculation, simultaneously carrying out global maximum pooling and global average pooling, carrying out convolution, and finally multiplying the new feature by the original feature to obtain a new feature, wherein the calculation formula is as follows:

wherein Ms is a spatial feature map, f ^7×7 It is shown that a convolution of 7 x 7 is performed,

and representing the splicing of the features subjected to global average pooling and global maximum pooling, wherein sigma is a sigmoid function.

And finally, inputting the new feature F' obtained by processing of the CBAM module into a global average paging layer of the Xconcept network, and then finely adjusting the parameters of the Xconcept + CBAM network. And inputting the labeled video frame into an Xcaption + CBAM network for training end-to-end video frame feature extraction, and storing parameters when the model achieves the best effect.

Step S3: the process of analyzing the timing characteristics by combining the bidirectional long and short term network (Bilstm) and the Conditional Random Field (CRF) is shown in FIG. 3, and the main steps are as follows:

extracting the characteristics of continuous 25 frames of video by using a trained Xcaption + CBAM network, obtaining an output 512-dimensional time sequence characteristic diagram after passing through a Global Average Pooling layer of the Xcaption network, and carrying out the operation

The feature map is used as an input of a bidirectional long-short term network (Bilstm), the input format of the Bilstm is (samples, times, dim), samples represents the total number of samples, times represents the time sequence length of processing, and dim represents the dimension number of the input feature. The blstm is composed of a forward lstm and a backward lstm. The method comprises the steps of respectively inputting the characteristics of video sequence frames extracted by an Xscene + CBAM network into a forward lstm and a backward lstm for time sequence analysis, splicing characteristic vectors generated by the forward lstm and the backward lstm by combining context information of video frames, and then carrying out classified prediction. The lstm controls the forgetting and memorizing information through a forgetting gate, an input gate and an output gate, and transmits useful information for subsequent time sequence analysis, so that the useful information is combined with the context information of the video frame characteristics to predict the current frame. The calculation to obtain the three lstm gates as shown in FIG. 4 is as follows:

forget the door: and determining which unimportant information is forgotten by calculation, and retaining which important information. The formula is (wherein f) _t Is a value from 0 to 1 indicating the retention of the network state at the previous time):

f _t ＝σ(b _f [h _t-1 ,x _t ]+k _f )

an input gate: the new information is added and the information is updated through calculation, and the formula is (i) _t Indicates a value to be updated; c _t Representing new candidate cell information, Ct representing updating the network state):

i _t ＝σ(b _i [h _t-1 ,x _t ]+k _i )

an output gate: the updated information is combined to judge the state characteristics of the output network, and the formula is (O) _t A judgment condition indicating an output; h is _t Representing the final output):

O _t ＝σ(b _o [h _t-1 ,x _t ]+k _o )

h _t ＝O _t *tanh(C _t )

wherein b and k in the formula represent weight matrix and bias respectively, and sigma is sigmoid function, [ h ] _t-1 ,x _t ]Output h indicating the last state to be _t-1 Input x with current state _t And (6) splicing. The respective timings of the forward lstm and backward lstm are output as h through the above calculation _t And splicing to obtain the probability of each label of each video frame.

These sequences of probability components are input into a Conditional Random Field (CRF) which, by itself learning some constraints, selects the best timing output for classification of counterfeit videos. The prediction results are scored mainly by the transition matrix and loss function of the CRF, and finally the highest scoring is selected as the final prediction sequence. The CRF scoring sequence was performed as follows:

wherein A is a label transfer score automatically learned by CRF during training, X is an input feature sequence, Y is a corresponding output label, and P is the prediction probability of Bilstm; the calculation process of the loss function is as follows:

according to the above steps, the features extracted in S2 are input into the blstm + CRF for training, and the parameters at which the model achieves the best effect are saved.

Step S4: and inputting the test set into the trained model to obtain an evaluation result of continuous K frames, and performing true and false classification on the video to be detected according to the obtained probability. And finally, evaluating the performance of the invention by calculating Accuracy (Accuracy), Precision (Precision), Recall (Recall), F1 and the like.

Second embodiment

Referring to fig. 5, a depth-forgery-video detection system based on video inter-frame consistency is characterized in that the method includes the following units: the device comprises a data preprocessing module, a video frame feature extraction module, a video frame time sequence analysis module and a fake video classification module.

The data preprocessing module is used for processing experimental data and mainly comprises three units: dividing a data set, taking frames and extracting faces. When a data set is divided, the data set is divided into a training set, a verification set and a test set according to a certain proportion; and then framing the video, finally extracting the face and normalizing the obtained face picture into uniform pixels, when the face is extracted, firstly framing a face area by using a face detector, and then extracting the face by aligning the face landmarks, thereby improving the detection rate of the face.

The video frame feature extraction module mainly comprises two units: processing of Xception networks, attention mechanism module (CBAM) of convolution module. Considering that the output of the Xconcept network passing through the global pooling layer loses channel information and spatial information to a certain extent, a convolution mode is introduced

The block attention mechanism module (CBAM) calculates the channel and spatial importance to extract more video frame level semantic features. In the CBAM part, firstly, the video frame passes through the first 14 blocks of the Xcenter network to extract the feature map, and then the extracted feature map is input to the channel attention module for processing and then enters the spatial attention module.

The video frame time sequence analysis module mainly comprises time sequence analysis of a bidirectional long-short term memory network (Bilstm) and result optimization of a Conditional Random Field (CRF). Considering that the unidirectional lstm cannot consider future information, the input sequence is subjected to time sequence consistency analysis through a bidirectional long-short term memory network (Bilstm) in combination with the context information of the feature sequence, and the feature sequence of the current input is comprehensively analyzed by splicing the calculated results of the forward lstm and the backward lstm. And finally, optimizing the prediction result of the Bilstm by a Conditional Random Field (CRF).

And the counterfeit video classification module is used for inputting the test set into the whole network for detection and classifying the videos according to the final output probability.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. The depth forgery video detection method based on time sequence inconsistency is characterized by comprising the following steps: the method comprises the following steps:

s2, inputting the processed video frame into an Xscene network, wherein an attention mechanism module CBAM of a convolution module is added into the Xscene network, and the network is trained by combining a space and a channel, and parameters when the model achieves the best effect are stored;

s3, extracting the characteristics of each continuous K frame sequence of the video by using a trained Xconcept model, extracting the characteristics of each K frame as a group to be used as the input of a bidirectional long-short term memory network (Bilstm) for training, wherein a Conditional Random Field (CRF) is added to adjust the prediction result of the Bilstm, and parameters when the model achieves the best effect are stored;

s4, detecting the video to be tested by using the trained bidirectional long-short term memory network Bilstm, and evaluating the performance of the model by outputting the accuracy of the test;

the step S2 specifically includes:

s21, introducing an attention mechanism module CBAM of a convolution module before a global pooling layer of the Xconcentration network, combining the channel attention and the space attention module sequence, deducing respective attention weights, multiplying the attention weights by a feature map extracted by the Xconcentration network block-14, automatically adjusting the features, and finally adjusting the Xconcentration + CBAM network;

s22, inputting the video frame with the label into an Xconvergence + CBAM network for training video frame feature extraction, and storing parameters when the model achieves the best effect;

the step S3 specifically includes:

s31, extracting the characteristics of continuous N frames of the video by using the trained Xtitle + CBAM network; a full connection layer is connected behind a global pooling layer of the Xconcept network, a 512-dimensional feature map is output, and the feature map is used as the input of a bidirectional long-short term network Bilstm;

forget the door: determining which unimportant information is forgotten through calculation, and reserving which important information; the formula is as follows:

f _t ＝σ(b _f [h _t-1 ,x _t ]+k _f )

wherein f is _t The value is 0-1, and represents the reservation condition of the network state at the previous moment;

i _t ＝σ(b _i [h _t-1 ,x _t ]+k _i )

i _t indicates a value to be updated;

O _t ＝σ(b _o [h _t-1 ,x _t ]+k _o )

h _t ＝O _t *tanh(C _t )

wherein b and k in the formula represent weight matrix and offset, respectively, and σ is sigmoid function, [ h ] _t-1 ,x _t ]Output h indicating the last state _t-1 Input x with current state _t Splicing is carried out;

s34, calculating the above, and outputting the timing h of each of the forward and backward lstm _t Inputting the probabilities of each label after splicing into a Conditional Random Field (CRF), and selecting the optimal time sequence output by the CRF layer through self-learning constraint for classifying forged videos; scoring the prediction result through a transfer matrix and a loss function of the CRF, and finally selecting the highest scoring as the final prediction sequence;

2. The method for detecting depth-forged video based on time sequence inconsistency according to claim 1, wherein: the step S1 specifically includes:

3. The method for detecting depth-forged video based on time sequence inconsistency according to claim 1, wherein: the step S4 specifically includes:

4. A depth forgery video detection system based on time sequence inconsistency is characterized in that: the system comprises the following units: the device comprises a data preprocessing module, a video frame feature extraction module, a video frame time sequence analysis module and a fake video classification module;

the video frame feature extraction module introduces an attention mechanism module CBAM of the convolution module to learn better video frame level features; the method specifically comprises the following steps:

s21, introducing an attention mechanism module CBAM of a convolution module before a global pooling layer of the Xconcentration network, combining the channel attention and the spatial attention module sequence together, deducing respective attention weights, multiplying the attention weights by a feature map extracted by the Xconcentration network block-14, automatically adjusting the features, and finally adjusting the Xconcentration + CBAM network;

the video frame time sequence analysis module is used for considering that future information cannot be considered by unidirectional lstm, performing consistency analysis on the time sequence of an input sequence by a bidirectional long-short term memory network (Bilstm) in combination with context information of a characteristic sequence, and finally optimizing a prediction result of the Bilstm through a Conditional Random Field (CRF);

the method specifically comprises the following steps:

s31, extracting the characteristics of continuous N frames of the video by using the trained Xceptation + CBAM network; a full connection layer is arranged behind a global pooling layer of the Xconcept network, a 512-dimensional feature map is output, and the feature map is used as the input of a bidirectional long-short term network Bilstm;

f _t ＝σ(b _f [h _t-1 ,x _t ]+k _f )

i _t ＝σ(b _i [h _t-1 ,x _t ]+k _i )

i _t indicates a value to be updated;

representing new candidate network information, Ct representing counterpoiseUpdating the link state;

O _t ＝σ(b _o [h _t-1 ,x _t ]+k _o )

h _t ＝O _t *tanh(C _t )

O _t a judgment condition indicating an output; h is a total of _t Representing the final output;

wherein b and k in the formula represent weight matrix and bias respectively, and sigma is sigmoid function, [ h ] _t-1 ,x _t ]Output h indicating the last state to be _t-1 Input x with current state _t Splicing is carried out;

s34, calculating the timing outputs h of the forward lstm and the backward lstm _t Inputting the probabilities of each label after splicing into a Conditional Random Field (CRF), and selecting the optimal time sequence output by the CRF layer through self-learning constraint for classifying forged videos; scoring the prediction result through a transfer matrix and a loss function of the CRF, and finally selecting the highest scoring as the final prediction sequence;

s35, inputting the features extracted in S2 into Bilstm + CRF for training, and storing parameters when the model achieves the best effect;

and the counterfeit video classification module inputs the test set into the whole network for detection, and evaluates the performance of the system by calculating an Accuracy index.