CN114067381A

CN114067381A - Deep forgery identification method and device based on multi-feature fusion

Info

Publication number: CN114067381A
Application number: CN202110473432.4A
Authority: CN
Inventors: 操晓春; 韩冰; 韩晓光; 张华�; 李京知
Original assignee: Institute of Information Engineering of CAS; Shenzhen Research Institute of Big Data SRIBD
Current assignee: Institute of Information Engineering of CAS; Shenzhen Research Institute of Big Data SRIBD
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2022-02-18

Abstract

The invention discloses a deep forgery identification method and device based on multi-feature fusion. The method mainly comprises the following steps: (1) carrying out segmented frame extraction on a video and carrying out face alignment pretreatment; (2) processing a video frame by adopting an RGB input stream and a learnable SRM input stream; (3) performing feature extraction on video frames by using RGB input streams and performing inter-frame fusion; (4) removing the non-conductive part of the classic SRM algorithm from the learnable SRM input stream, replacing the hyper-parameter q with 30 learnable matrixes of 5 x 5, and initializing; (5) converting the SRM filters with 30 preset parameters in the classic SRM algorithm into learnable SRM convolution kernels and inserting the learnable SRM convolution kernels into the identification network in the step (3) to form a learnable SRM network; and finally, fusing the outputs of the RGB stream and the learnable SRM stream to obtain a final recognition result. The invention can effectively improve the depth forgery identification effect on the low-definition video.

Description

Deep forgery identification method and device based on multi-feature fusion

Technical Field

The invention belongs to the field of computer vision depth forgery identification, and particularly relates to a depth forgery identification method and device based on multi-feature fusion.

Background

The term deep forgery is derived from a face changing software named as deep fakes, and is then extended to refer to all AI face changing technologies realized by computer graphics or deep learning technologies. The abuse of deep-forgery technology has brought many negative impacts to society in recent years, for effective video of deep forgery. The general flow of depth forgery identification is that firstly, the face detection is carried out on a given depth forgery video, then the feature extraction is carried out on the extracted face, and finally whether the given video is depth forgery or not is judged according to the extracted feature.

Among the currently common Deep forgery Recognition algorithms are FWA (Y.Li and S.Lyu, "expanding detection video by detecting detection face warns artifacts," in IEEE Conference on Computer Vision and Pattern Recognition works, CVPR works Computer Vision Foundation/IEEE,2019, pp.46-52.), and Xceptation (F.Chollet, "Xceptation: Deep forgery with depth detection possibility solutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017, pp.1251-1258). The FWA mainly detects the splicing trace generated when the real face is replaced by the fake face in the last step of the generation of the depth fake video; and the Xconcept detects the forged trace generated in the whole depth forging generation process.

One difficulty of deep forgery identification is that when the definition of a forged video is low, forged traces such as splicing traces of human face edges, inconsistency between video frames, generation traces of forged human faces and the like are more difficult to find, which greatly increases the difficulty of deep forgery identification. The current depth counterfeiting method cannot achieve good effect on low-definition video.

Disclosure of Invention

The invention mainly solves the technical problem of providing a depth forgery identification method and a depth forgery identification device, and can solve the problem that the existing identification method has poor effect on low-definition depth forgery video.

In order to solve the technical problem, the invention provides a deep forgery identification method based on multi-feature fusion, which comprises the following steps:

averagely dividing an input video into a plurality of video segments, randomly sampling a plurality of video frames for each video segment, and performing face detection and face alignment on each selected video frame to serve as an input video frame;

respectively processing an input video frame by adopting an RGB input stream and a learnable SRM input stream, wherein the RGB input stream extracts semantic features of suspicious forged parts in the video frame and obtains a prediction result of deep forgery recognition according to the semantic features, and the learnable SRM input stream fits noise features of the suspicious forged parts in the video frame and obtains a prediction result of the deep forgery recognition according to the noise features;

and fusing the prediction result of the RGB input stream and the prediction result of the learnable SRM input stream to obtain a final deep forgery recognition result.

Further, a plurality of video frames are randomly sampled for each video segment, and the png format is used as much as possible when the video frames are extracted, so that the influence of picture compression on tampering marks is reduced.

Further, the RGB input stream extracts semantic features of suspected counterfeit portions in the video frame and obtains a prediction result of deep counterfeit recognition according to the semantic features, including:

for the RGB input stream, respectively carrying out feature extraction on each video frame after face alignment by using an Xtrapping network, extracting the semantic features of suspicious forged parts in each video frame by using the Xtrapping network, finally averaging all the extracted features and activating by using a Softmax function to obtain the output of the RGB input stream, wherein the Xtrapping network in the whole process shares parameters.

Further, the fitting of the learnable SRM input stream to the noise characteristics of the suspected counterfeit parts in the video frame and obtaining the prediction result of the deep counterfeit recognition according to the noise characteristics includes:

for the learnable SRM input stream, firstly removing the non-conductive parts, namely round function and truncate function, in the classic SRM algorithm, then replacing the hyperparameter q with 30 learnable matrixes of 5 x 5 to correspond to 30 SRM filters in the classic SRM algorithm, and respectively initializing the learnable matrixes into the maximum absolute values of all the elements which are equal to each other and are equal to the elements in the corresponding SRM filters;

dividing 30 SRM filters with corresponding learnable matrixes to obtain learnable matrixes with the dimensionality of 30 x 5, expanding the matrixes into SRM convolution kernels with the dimensionality of 30 x 3 x 5, inserting the SRM convolution kernels into an original Xception network architecture as a first layer of a neural network, and finely adjusting the first layer of the original Xception network to form the learnable SRM network;

for the learnable SRM input stream, respectively extracting the features of each video frame after the face alignment by using a learnable SRM network, fitting and analyzing the noise features of the suspicious tampered parts in the K T frames by using the learnable SRM network, and finally averaging all the extracted features and activating the extracted features by using a Softmax function to obtain the output of the learnable SRM input stream; the SRM network sharing parameters can be learned in the whole process.

Further, in the RGB input stream and the learnable SRM input stream, the intra-stream networks share parameters, while the inter-stream networks are trained independently.

Furthermore, by setting a learnable matrix, the essential characteristics of the SRM filter are not damaged in the network training process; reserving 0-value elements in 30 preset SRM filters by adopting a learnable matrix so as to reserve the characteristics of noise information calculated by the SRM filters; by initializing the learnable matrix, it is guaranteed that all parameters of the SRM convolution kernel are initialized to values belonging to [ -1,1 ].

Further, the learnable matrix of 30 x 5 is extended to convolution kernels of 30 x 3 x 5 in such a way that the matrix of 5 x 5 is replicated in the second dimension to exactly 3 equal parts.

Based on the same inventive concept, the invention also provides a deep forgery identification device based on multi-feature fusion by adopting the method, which comprises the following steps:

the preprocessing module is used for averagely dividing the input video into a plurality of video segments, randomly sampling a plurality of video frames for each video segment, and performing face detection and face alignment on each selected video frame to serve as an input video frame;

the double input stream processing module is used for respectively processing an input video frame by adopting an RGB input stream and a learnable SRM input stream, wherein the RGB input stream extracts semantic features of suspicious forged parts in the video frame and obtains a prediction result of deep forgery recognition according to the semantic features, and the learnable SRM input stream fits noise features of the suspicious forged parts in the video frame and obtains a prediction result of the deep forgery recognition according to the noise features;

and the fusion module is used for fusing the prediction result of the RGB input stream and the prediction result of the learnable SRM input stream to obtain a final deep forgery recognition result.

The invention has the characteristics and beneficial effects that:

the invention adopts a network based on multi-feature fusion to identify the depth-forged video, can simultaneously fit forged traces of the input video on semantic features and noise features, and effectively improves the effect of the existing depth-forged identification method on the low-definition video.

Drawings

FIG. 1: network framework architecture diagram.

FIG. 2: and visualizing the result by using a plurality of SRM flow calculation modes.

Detailed Description

The invention provides a depth forgery identification method based on multi-feature fusion, aiming at the problem that the existing depth forgery algorithm has unsatisfactory effect when processing low-definition depth forgery video, and the overall frame structure of the method is shown in figure 1. Experiments are performed below to illustrate the effectiveness of the present invention.

The experimental data adopts the lowest definition version of a faceforces + + deep counterfeiting data set, 1000 sections of real videos are provided in total, each section of real video is provided with 3 corresponding counterfeiting videos generated by the methods of Deepfakes, faceSwap and Face2Face respectively, namely 3000 counterfeiting videos are provided in total.

The experimental procedure was as follows:

(1) first, the input video V is divided into K segments { V }₁,v₂,…,v_KV for each segment of video_iRandomly sampling T frames

And finally, carrying out face detection and face alignment on the selected K T frame by using Dlib (A.Rossler, D.Cozzolino, L.Verdoliva, C.Riess, J.Thies, and M.Nie beta ner, facial features: A large-scale video database for finger detection in human faces, 2018):

I_k＝A(v_k)

I_krepresenting an input to be fed into the identification network; k is an element of [1, K ]]Indexing K video segments, each video segment comprising T frames; a represents a face alignment operation.

(2) Input video frames are processed separately using RGB input streams and learnable SRM input streams based on the classical SRM algorithm (J.Fridrich and J.Kodovsky, "Rich models for statistical analysis of digital images," IEEE Transactions on Information dynamics and Security, vol.7, No.3, pp.868-882,2012.). The RGB input stream uses the aligned human face as input, and aims to extract semantic features in the human face video frame, while the learnable SRM input stream uses a noise map obtained by processing the human face by an SRM filter as input, and mainly focuses on fitting the noise features in the human face video frame:

an input for the k-th video segment in the RGB stream;

an input for a k-th video segment in a learnable SRM stream; s represents the learnable SRM filter operation.

(3) And for the RGB input stream, respectively extracting the features of the K × T frames after the face alignment by using an Xscene network. And the Xception network extracts the semantic features of the suspicious forged regions from the K x T frames, averages all the extracted features and activates the features through a Softmax function to obtain the output of the RGB stream, namely the segment fusion of the RGB stream is realized. The Xmeeting network sharing parameters in the whole process are as follows:

P_R＝σ(F_R)

F_Rfeatures of an RGB stream; avg is an averaging operation; w_RNetwork parameters for RGB streams; an example is a convolution operation; σ is Softmax operation; p_RIs a prediction vector for the RGB stream.

(4) For the learnable SRM input stream, because the classic SRM algorithm cannot achieve a good effect on the deep forgery task, the learnable SRM filter is introduced to better fit the face data. To achieve learnability, the non-conductible parts in the classical SRM algorithm, i.e., the round function and the truncate function, are first removed. The truncate function is used in the classic SRM algorithm mainly for computing the co-occurrence matrix, and the task does not need to compute the co-occurrence matrix; the round function is then no longer important after introducing the learning, so the choice is to eliminate these two non-derivable parts and to implement a learnable SRM filter. In order to carry out certain constraint on the learning process of the SRM filter while introducing the learnability, a learnable matrix Q is introduced to replace a hyper-parameter Q, and the values of 30 filters in the classic SRM algorithm are kept unchanged. The hyperparameter Q is replaced with 30 learnable matrices Q of 5 x 5 to correspond to 30 SRM filters in the classical SRM algorithm and is initialized to the maximum absolute value of all elements equal and equal to the elements in its corresponding SRM filter, respectively.

The purpose of setting the learnable matrix in step (4) is to ensure that the network training process does not damage the essential features of the SRM filter. The learnable matrix can be used for reserving 0-value elements in the preset 30 SRM filters, so that the characteristic of calculating noise information is reserved.

The reason for choosing the way in which the learnable matrix is initialized in this step (4) is to ensure that all parameters of the SRM convolution kernel are initialized to values belonging to [ -1,1 ].

(5) And (4) dividing the 30 SRM filters by the learnable matrix corresponding to the SRM filters to obtain the learnable matrix with the dimension of 30 x 5. Then, the matrix is expanded into an SRM convolution kernel of 30 × 3 × 5, and the SRM convolution kernel is used as a first layer of a neural network and inserted into an original Xception network architecture to form a learnable SRM network:

r is the output of the SRM filter; x is an input of the SRM input stream; an example is a convolution operation; w is the filter matrix in the classical SRM algorithm. A learnable SRM filter is implemented by converting the classical SRM algorithm into a convolution operation and removing the non-differentiable part.

Wherein the learnable matrix of 30 x 5 is extended to a convolution kernel of 30 x 3 x 5 in such a way that the matrix of 5 x 5 is replicated in the second dimension to exactly 3 equal parts.

(6) Processing the learnable SRM input stream in a similar way to the way of (3), replacing the original Xconcept network in (3) with the learnable SRM network obtained in (5) to fit and analyze the noise characteristics of the suspected forged part in the K T frame, thereby obtaining the learnable SRM input streamMeasurement result P_sNamely, the segment fusion of the SRM stream is realized. The shared parameters of the learnable SRM network in the whole process are:

P_S＝σ(F_s)

F_Sis a feature of a learnable SRM stream; avg is an averaging operation; w_SNetwork parameters for RGB streams; an example is a convolution operation; σ is Softmax operation; s is a prediction vector of the RGB stream.

(7) And (3) fusing the outputs of (3) and (6) by using a learnable linear function to obtain a final prediction result:

P＝H(P_R，P_S)

p is the final prediction result (i.e. the depth forgery identification result of the video), P_RAnd P_SThe prediction results for the RGB stream and the learnable SRM stream, respectively, H is a linear function.

The following tests were carried out. The evaluation index adopts the identification accuracy:

wherein TP is a true positive case and FP is a false positive case. FN is false negative and TN is true negative.

For data, videos with numbers 1-720 (including 720 real videos and 2160 fake videos) in faceforces + + are selected as a training set, videos with numbers 721 and 960 are selected as a verification set, and videos with numbers 961 and 1000 are selected as a test set.

120 training runs were performed according to the experimental method mentioned above, and the one that performed the best on the validation set was selected as the model to be tested. The model was used to perform the test on the test set, and the final calculated result was 90.36% accuracy.

In order to prove the effectiveness of the method, two comparison experiments are needed, wherein the first comparison experiment is used for verifying whether the learnability in the SRM flow can improve the model accuracy. The method also imposes certain constraint on the learning process of the SRM filter when introducing the learnability, so that three conditions need to be compared: non-learnable, unconstrained learnable, and constrained learnable. The experiment uses Xception as a feature extraction network and only references the accuracy of the SRM stream. The final calculation result was that the unlearned SRM flow accuracy was 78.21%, the unconstrained SRM flow accuracy was 85.71%, and the constrained SRM flow accuracy was 90.00%.

The second comparison experiment is to verify whether the identification accuracy of the network can be improved by using multi-feature fusion, and whether the multi-feature fusion mode is effective for various feature extraction networks. For comparison, ResNet-101, LightCNN and Xcenter are selected as the feature extraction network, and the test is performed by respectively adopting single RGB feature and multi-feature fusion. The final calculation result is that the accuracy of ResNet-101 under the single RGB characteristic is 85.71%, the accuracy of LightCNN is 86.43%, and the accuracy of Xception is 87.86%; the accuracy of ResNet-101 under multi-feature fusion is 88.21% (+2.50), the accuracy of LightCNN is 87.86% (+1.43), and the accuracy of Xception is 90.36% (+ 2.50).

The output of the learnable SRM stream may also be visualized. The visual result proves that the noise map generated by the learnable SRM filter can accurately reflect the forged part of the input video frame; it has also been demonstrated that the constrained learnable SRM stream proposed by the present invention can generate better noise maps than the non-learnable and unconstrained SRM streams. The visualization results are shown in fig. 2.

In other embodiments of the present invention, the Xception network selected in step (3) and step (5) may be replaced by other identification networks.

Based on the same inventive concept, another embodiment of the present invention provides a deep forgery identification apparatus based on multi-feature fusion using the above method, including:

The specific implementation process of each module is referred to the description of the method of the invention.

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent flow transformations that may be embodied in the present specification and drawings, or directly or indirectly applied to other related arts, are included in the claimed scope of the present invention.

Claims

1. A deep forgery identification method based on multi-feature fusion is characterized by comprising the following steps:

2. The method according to claim 1, wherein a plurality of video frames are randomly sampled for each video segment, and png format is used as much as possible when extracting the video frames, so as to reduce the influence of picture compression on the falsification trace.

3. The method for deep forgery identification based on multi-feature fusion of claim 1, wherein the RGB input stream extracts semantic features of suspected forgery sites in video frames and obtains a prediction result of deep forgery identification based on the semantic features, including:

4. The method for deep forgery identification based on multi-feature fusion of claim 1, wherein the learnable SRM input stream is fitted with noise features of suspected forgery sites in video frames and obtains a prediction result of deep forgery identification according to the noise features, comprising:

dividing 30 SRM filters with corresponding learnable matrixes to obtain learnable matrixes with the dimensionality of 30 x 5, expanding the matrixes into SRM convolution kernels with the dimensionality of 30 x 3 x 5, inserting the SRM convolution kernels into an original Xception network architecture as a first layer of a neural network, and finely adjusting the first layer of the original Xception network to form the learnable SRM network; for the learnable SRM input stream, respectively extracting the features of each video frame after the face alignment by using a learnable SRM network, fitting and analyzing the noise features of the suspicious tampered parts in the K T frames by using the learnable SRM network, and finally averaging all the extracted features and activating the extracted features by using a Softmax function to obtain the output of the learnable SRM input stream; the SRM network sharing parameters can be learned in the whole process.

5. The deep forgery identification method based on multi-feature fusion of claim 3 or 4, wherein in the RGB input stream and the learnable SRM input stream, the network in the stream shares parameters, and the network between the streams is trained independently.

6. The deep forgery identification method based on multi-feature fusion of claim 4, wherein the setting of the learnable matrix ensures that the network training process will not damage the essential features of the SRM filter; reserving 0-value elements in 30 preset SRM filters by adopting a learnable matrix so as to reserve the characteristics of noise information calculated by the SRM filters; by initializing the learnable matrix, it is guaranteed that all parameters of the SRM convolution kernel are initialized to values belonging to [ -1,1 ].

7. The method of multi-feature fusion based deep forgery identification of claim 4, wherein the learnable matrix of 30 x 5 is extended to the convolution kernel of 30 x 3 x 5 in such a way that the matrix of 5 x 5 is duplicated to exactly 3 copies in the second dimension.

8. A deep forgery identification device based on multi-feature fusion and using the method of any one of claims 1 to 7, comprising:

9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.