AU2019101186A4

AU2019101186A4 - A Method of Video Recognition Network of Face Tampering Based on Deep Learning

Info

Publication number: AU2019101186A4
Application number: AU2019101186A
Authority: AU
Inventors: Zhongliang Guo; Dian JIA; Zhaokai Wang; Jiahang WU; Yongqi Zhou
Original assignee: Zhou Yongqi Miss
Current assignee: Zhou Yongqi Miss
Priority date: 2019-10-02
Filing date: 2019-10-02
Publication date: 2020-01-23
Anticipated expiration: 2027-10-02

Abstract

The invention patent specifically designs a model that can be used to detect face tampering such as Deepfake in videos based on Convolutional Neural Network and Inception module. Since traditional techniques are impotent to detect videos, we come up with a deep learning architecture trained with FaceForensics1OO++. By adjusting the hyperparameters of the network and adopting different optimization algorithms, the network finally achieves optimal performance with an accuracy rate of 94.5% in test sets. This method can be applied to distinguish fake news videos, thus avoiding being deceived by them. Input256x256x3 Inception Module 1 Batch normalization Max Pooling 2x2 128x128x11 Inception Module 2 Batch normalization Max Pooling 2x2 64x64x12 Convolutional 16x(5x5)+ReLU Batch normalization Max Pooling 2 X 2 32x32x16 Convolutional 16x(5x5)+ReLU Batch normalization Max Pooling 4 X 4 8x8xl6 1024 features Dropout 0.5 Fully connected 16+LeakyReLU 16 features Dropout 0.5 Fully connected 1 Classification Result Figure 1

Description

A Methord of Video Recognition Network of Face Tampering Based on Deep Learning

FIELD OF THE INVENTION

This invention is in the field of digital signal processing and serves as classification of real and Deepfake videos powered by deep learning.

BACKGROUND

With the rapid development of artificial intelligence, various techniques using deep learning have appeared. One technique, Deepfake, can alter the contents of video by replacing the targeted face with another face. Due to simple operation and impressive results, it has become incredibly popular among Internet users, leading to a large number of fake videos on the Internet. Some of them are difficult to be separated from real ones, which can confuse people and lead to misunderstanding. We have seen the urgent need of advanced video recognition techniques.

i

2019101186 23 Dec 2019

Currently most of the image forensics techniques are traditional, and cannot apply to videos due to the compression and degradation of videos. Some researchers, however, have come up with image forensics techniques using deep learning, which can distinguish whether an image has been edited by software such as Photoshop. Their methods inspire us to extend them to video recognition.

In this invention, we use Pytorch as the deep learning framework to implement the model. The model is based on Convolutional Neural Network and Inception module. After data preprocessing, we feed the training set of data into the convolutional neural network in batches. Through the entire training process, the program optimizes sets of weights and biases across all layers of the neural network with the goal of minimizing the loss function. By constantly adjusting the parameters of the network such as the initial learning rate and decay rate, the model reaches the optimal performance and the test set can be recognized with 94.5% accuracy.

SUMMARY

In order to distinguish fake videos from real ones, this invention proposes a video recognition method for face tampering based on deep learning. Using multi-layer Convolutional Neural Networks with Inception module and adopting the layer-by-layer initializing training mode gives full play to the advantages of binary classification in deep

2019101186 23 Dec 2019 learning, which in this case decides whether a video has been edited by

Deepfake. This invention significantly improves the test set accuracy and overcomes some of the technical difficulties of training process like overfitting.

The framework of our deep learning video recognition method for face tampering includes: building image database, construction of convolutional neural networks, parameter optimization, and implementation of recognition.

In order to build image database, we gather real and Deepfake videos from FaceForensics 100++ and transform them into approximately 20000 images in total. Then we obtain rectangular images with people’s faces on them with detector function from dlib module. Next, we reshape the images into 256X256 shape and put on labels of “real” and “df” with ImageFolder function, thus building an entire database, which is later divided into train set, valid set and test set in proportions of 8:1:1.

Our convolutional neural network is a sequence of layers. Figure 1 displays the architecture of our network. The network has two Inception modules in the beginning, each of which consists of four paralleling convolutional layers, followed by two normal convolutional layer. These four layers are each activated by the ReFU function and followed by Batch normalization and max pooling. Then we have two fully connected layers with a dropout rate of 50%. The last part of the network is a

2019101186 23 Dec 2019 sigmoid activation function, converting the input into a result between 0 and 1, which can be interpreted as probability to conduct binary-classification.

During the training set optimization, we put the data set in batches into the network. The ultimate is to minimize the loss function. As for optimized algorithms, we choose SGD optimization since it can gradually decrease the value of learning rate. To improve the recognition accuracy, we use Batch normalization and drop-out regularization to eliminate overfitting. We try different hyperparameters to get the optimized result.

Finally, the classifier is able to identify fake videos and the result will be presented. We also design an application which can mark the real and fake faces in a given video.

DESCRIPTION OF DRAWING

Figure 1 shows the data flow of our convolutional neural network.

Figure 2 shows the data flow of Inception modules.

Figure 3 shows the procedure of our project.

Figure 4 shows the structure of layer 1.

Figure 5 shows the structure of layer 2.

Figure 6 shows the structures of layer 3 to layer 6.

Figure 7 shows the accuracy and loss of the optimal network.

Figure 8 shows the principle of convolution.

Figure 9 shows the graph of ReLU function.

2019101186 23 Dec 2019

Figure 10 shows the principle of Max Pooling.

Figure 11 shows the principle of drop-out algorithm.

DESCRIPTION OF PREFERRED EMBODIMENT

Network Design

The structure of our neural network shows in Figure l.We use Inception network as our model. The network has two Inception layers followed by 1 normal layer.

Before we start, we have some definitions that need to be explained in advance.

1) Stride: Stride controls how the filter convolves around the input volume.

2) Zero-padding: pads the input volume with zeros around the border. It has two modes. “Valid” means no padding. “Same” means the output image is the same size as the input image. In this program we use “Same” mode.

The general calculation formula for the convolution image size is n + 2p-f .. n+2p-f ..

(-----+ ,x (-----+ 1).

s J L s J where n is the size of input image, p means zero-padding, f is the size of filter and s is step size.

3) Inception Layer: The original paper introducing and describing this

Inception architecture is - Going Deeper with Convolutions.

2019101186 23 Dec 2019

The key idea for devising this architecture is to deploy multiple convolutions with multiple convolution kernels and pooling layers simultaneously in parallel within the same layer (inception lay er).In this program we use Inception layers.

4) Batch normalization: Batch normalization is a technique for improving the speed, performance, and stability of artificial neural networks. Batch normalization was introduced in a 2015 paper. It is used to normalize the input layer by adjusting and scaling the activations.

(1) Inception Layer 1

The input data of the first Inception layer is the original [256x256x3] image, which is convoluted by 4 branches simultaneously in parallel.

The input image of the first branch is convoluted by a [1x1x3] convolution kernel, and the convolution kernel generates a new pixel for each convolution of the original image. The convolution kernel moves in both the X-axis direction and the y-axis direction of the original image, and the moving step size is 1 pixel. And because we use 1x1 as the convolution kernel size, the generate image size will still be [256x256] pixels. But the channel of the original image changes from three channels to one channel. The goal is to reduce the number of channels.

The input image of the first convolution layers of the second branch is convoluted by four [1x1x3] convolution kernels. Same as the first

2019101186 23 Dec 2019 convolution layers, each of kernels generate [256x256x1] pixel data. And there are 4 convolution kernels so the depth will be 4. The input image of the second convolution layers is the output of the first convolution layer, which is [256x256x4] pixel data set, and each set of pixel data is convoluted by 4 [3x3x4] convolution kernels. And we use the ‘same’ model of zero-padding to keep the output image is the same size of the input image. So the generate pixels will be [256x256], There are 4 kernels and the depth will be 4.

The input image of the third branch is the same as the second branch. And the result will also be [256x256x4] pixels.

The input image of the first convolution layers of the forth branch is convoluted by two [1x1x3] convolution kernels. The convolution process of kernels is the same as before.The generate pixels will also be [256x256], There are 2 kernels and the depth will be 2. The input image of the second convolution layers is the output of the first convolution layer, which is [256x256x2] pixel data set. And each set of pixel data is convoluted by 2 [3x3x2] convolution kernels. Also we use the ‘same’ model of zero-padding to keep the output image is the same size of the input image. So the generate pixels will be [256x256], There are 2 kernels and the depth will be 2.

Finally we concatenate all the output results, the depth will be 1+4+4+2=11. And we batch normalize the finally output pixels set. Then

2019101186 23 Dec 2019 we put this pixel data into the first max-pooling layer.

Put this pixel data into the max-pooling layer is in order to reduce the spatial size of the data to reduce the amount of parameters and computation in the network. It can significantly avoid overfitting.

So finally it will be a [128x128x11] data.

(2) Inception Layer 2

The input image of the inception layer 2 is the output of the inception layer 1, which is convoluted by 4 branches simultaneously in parallel like the inception layer 1. The input image of the first branch is convoluted by two [Ixlxll] convolution kernels, and the convolution kernel generates a new pixel for each convolution of the original image.

The input image of the first convolution layers of the second branch is convoluted by four [Ixlxll] convolution kernels. Each of kernels generates [128x128x1] pixel data. And there are 4 convolution kernels so the depth will be 4. The input image of the second convolution layers is the output of the first convolution layer, which is [128x128x4] pixel data set, and each set of pixel data is convoluted by 4 [3x3x4] convolution kernels. And also there are 4 convolution kernels so the depth will be 4.

The input image of the third branch is the same as the second branch. And the result will also be [128x128x4] pixels.

The input image of the first convolution layers of the forth branch is convoluted by two [Ixlxll] convolution kernels. The input image of the

2019101186 23 Dec 2019 second convolution layers is the output of the first convolution layer, which is [128x128x2] pixel data set. Also we use the ‘same’ model of zero-padding to keep the output image is the same size of the input image.

So the generate pixels will be [128x128], There are 2 kernels and the depth will be 2.

Finally we concatenate all the output results, the depth will be 2+4+4+2=12. And we batch normalize the finally output pixels set.

And we batch normalize the finally output pixels set. Then we put this pixel data into the first max-pooling layer. So finally it will be a [64x64x11] data.

(3) Convolutional Eayer 1

The input data of first convolutional layer is the output of second inception layer, which is [64x64x12] pixel data, and each set of pixel data is convoluted by 16 [5x5x12] convolution kernels. There are 16 kernels and the depth will be 16.We use zero-padding to keep the output image is the same size of the input image. These pixel layers are processed by the ReEU unit, and the size is still [64x64x16], Then we batch normalize the data and put this pixel data into the first max-pooling layer. So finally it will be a [32x32x16] data.

(4) Convolutional Eayer 2

The input data of the second convolutional layer is the pooling layer output, which is [32x32x16] pixel data. Each set of pixel data is

2019101186 23 Dec 2019 convoluted by 16 [5x5x16] convolution kernels. Same as the calculation steps of Convolutional layerl, these pixel layers are processed by the

ReLU unit, and the size is still [32x32x16] pixel layer data. Then we batch normalize the data and put this pixel data into the second max-pooling layer. So finally it will be a [8x8x16] data.

(5)Fully Connected Layer

After finishing convolution, we reshape the image matrix from [batch χ8χ8χ16] to [batchxl024]. Then we drop out half units of the neural network to reducing overfitting.

The fully connected layer 1 has 1024 nodes, each of which has full connections to all activations in the input. The final output of fully connected layer 1 is the high-level feature of the input image. Then we use an activation function called leakyReLU to normalize the data and we use dropout function again to avoid overfitting. Then we put the data into the final fully connected layer. The final fully connected layer will output the probability of whether pictures are between 0 and 1.

4.2. Procedure

Step 1: Data Acquisition (1) videos to images

In this project, we set up the original database of 200 videos from FaceForensics, half of which are real-shot while the other half are fake io

2019101186 23 Dec 2019 videos generated or falsified by Deepfake. The length of each video varies from a few seconds to a few minutes. In order to obtain image data for training, we use the videocapture function to capture frames from videos and save them as jpg images. Finally, we build an image database consisting of approximately 10000 real pictures and 10000 fake ones.

(2) images to faces

To train our network directly with images of faces, we extract faces from the image database. Firstly, we read images from folders and transform them into grayscale images. Then we detect the faces using frontal_face_detector function from dlib module as the feature extractor, and cut the images based on the face parts detected by the function. Finally the original database is transformed into a face image database of 10000 real faces and 10000 Deepfake faces.

(3) dividing the dataset

The dataset is divided into train set, valid set and test set in proportions of 8:1:1, each of which contains half real images and half fake ones which are saved in two separate folders.

Step2: Data Preprocessing

Using transforms function to preprocess and augment the image data. For machine learning, large size of images will lead to long training time, while small size will lead to a decrease in accuracy. Therefore, we

2019101186 23 Dec 2019 reshape the images into 256*256 size, and set the network structure accordingly. The images are then randomly cut by RandomCrop with padding=4 in order to augment data. Next, the images’ gray scales between 0 and 255 are conversed into values between 0 and 1. After normalization, we finally obtain processed data that can be input into the network and trained effectively. We set a medium batch size in Dataloader, hopefully it can increase the training speed as well as use memory of GPU properly. The data is shuffled to avoid training successively with images from the same video.

Step3: Training and Optimization

In the optimization of the parameter set, we put the data into the network for training in batches. For each batch, we initialize the gradient to zero and calculate the predicted value through forward propagation. Based on the loss function between predicted value and ground truth, we adjust the gradient through backward propagation. After an epoch of training during which all data has been trained, the network is then validated with the valid set.

About loss function:

In the backward propagation of neural networks, loss function can reveal the error between ground truth and predicted value. We choose the Cross Entropy Loss function based on two reasons:

2019101186 23 Dec 2019 (1) It has simpler derivation results y. _a_ (2) It has relatively high learning speed. ¹

About optimizing algorithms:

In choosing the optimizing algorithms, we try the prevailing Adam algorithm at first, which keeps exponentially weighted averages of not only previous gradient square vt, but also previous gradient mt similar to momentum algorithm:

m_t + (1 - A)#*·

V, =/?2Vf_l +(1

Its gradient descent method is as follows:

(9_i+i = &_t - ——---m_t.

yv_t + e

However, this algorithm does not perform well in training and testing. Due to variation of data, the accuracy fluctuates greatly after several epochs and cannot converge. Therefore, we decide to use SGD algorithm and use learning rate decay strategy.

θ = θ-η>

Step4: Validation

After each epoch of training, we use the valid set to validate the network. The procedures are similar to the first part of training but without backward propagation and adjusting gradients. Each batch of valid set

2019101186 23 Dec 2019 goes through forward propagation to calculate the loss function and accuracy, which help us to adjust the hyperparameters in the network.

Step5: Iteration

Repeating step3 and step4 with a large group of data while decreasing the learning group every several epochs, we finally obtain an optimal network with high accuracy

Step6: Data Visualization and Saving Network Model

We use tensorboard to visualize the loss function and accuracy of each epoch, so as to see the results more directly and adjust the hyperparameters. Parameters and gradients of the network are saved with torch.save function so that we can load and test the network model when it achieves optimal accuracy.

Step7: Testing and Application

The optimal network is tested with the test set to give a final result. Our network shows a test set accuracy over 95%, which satisfies our requirement. We apply our network to videos on the Internet to distinguish whether it is forged by Deepfake. We also create an application which can mark the real and fake faces in the video.

2019101186 23 Dec 2019

Table 1 Recognition Result

Group	Initial learning rate (LR)	Random Crop Padding	LR decay step size	LR decay rate	Loss	Accuracy
1	0.1	4	6	0.2	0.23	91.3%
2	0.03	4	6	0.2	0.15	94.5%
3	0.1	NO CROPPING	6	0.2	0.24	91.1%
4	0.01	8	6	0.2	0.22	91.2%
5	0.03	8	8	0.2	0.19	93.0%

Claims

1. A Method of Video Recognition Network of Face Tampering Based on Deep Feaming,wherein said Method is based on deep learning, it consists of convolutional layers, fully connected layers and inception modules, inception module significantly reduces the number of layers, thus avoiding overfitting and obtaining better results.

2. A Method as claim 1 said, wherein makes full use of the data from FaceForensics 100++ to implement the deep learning: run iterations on train set, valid set and test set divided from the original data set.

3. A Method as claim 1 said, wherein introduces batch normalization and drop-out to the network, so that overfitting of data can be avoided, batch normalization can add some noise to each hidden layer’s activation, so it has a slight regulation effect, dropout helps to avoid dependence between neurons, making the network more robust, by constantly adjusting the parameters, our invention reaches higher recognition accuracy

4. A Method as claim 1 said, wherein during the training process, search the optimal parameters by going through different sets of hyperparameters, such as the initial learning rate, learning rate decay and drop-out rate, covering most of the possible conditions, also try different optimizing algorithms like Adam and SGD.