CN113807214A

CN113807214A - Small target face recognition method based on deit attached network knowledge distillation

Info

Publication number: CN113807214A
Application number: CN202111015756.XA
Authority: CN
Inventors: 宋尧哲; 孟方舟; 舒子婷; 吴萌萌; 童官军
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-12-17
Anticipated expiration: 2041-08-31
Also published as: CN113807214B

Abstract

The invention relates to a small target face recognition method based on bit attached network knowledge distillation, which comprises the following steps: building a deit network as a student network, building a teacher network, adding a residual connection module behind the teacher network, and training the student network on the high-pixel face image by using the teacher network; inputting a small target face image to the trained student network to obtain a second classification characteristic and a second distillation characteristic; inputting an image which has the same identity as the trained exit network but is not subjected to downsampling into the teacher network to obtain a second teacher characteristic; constructing a third loss function according to the second classification characteristic and the real label, constructing a fourth loss function according to the second distillation characteristic and the second teacher characteristic, and adding the third loss function and the fourth loss function to obtain a second total loss; and under the second total loss, performing secondary training on the trained exit network. The invention can effectively identify the small target face image.

Description

Small target face recognition method based on deit attached network knowledge distillation

Technical Field

The invention relates to the technical field of computer vision, in particular to a small target face recognition method based on bit attached network knowledge distillation.

Background

With the continuous updating and proposing of deep learning algorithm and corresponding large-scale data set, the face recognition is greatly developed. Under the conditions of fixed face posture (front face), clear image and closed environment (no uncertain category), the face recognition accuracy rate can be up to more than 99%.

However, in a monitoring environment, due to practical problems such as low camera resolution, long human face target distance, and relative motion blur of targets, an actually acquired small target human face image often has multiple postures (such as a side face and an upward head), low resolution (lower than 32 × 32 pixels), and a noise interference state. Meanwhile, because all detected face targets can not be matched with the identity of the person in the database in the field monitoring environment, the problem of small target face recognition becomes an open environment problem at the same time.

Due to the reasons, the performance of the face recognition algorithm with fixed posture, clear image and excellent performance in a closed-loop environment is often reduced sharply in a real environment. The performance of the algorithm is reduced not only in that the performance is greatly reduced when the algorithm is trained on a high-pixel face image and then directly tested on a small-target face image in a monitoring environment, but also in that the performance is poor even if the algorithm is used for training in the small-target face image in the monitoring environment and then tested in the same small-target face image. The reason is that if the training is carried out on a high-pixel data set and the small target face image is tested, the domain transfer problem is caused by different distribution of the data set, so that overfitting is caused; in the direct training of the small target face image, the features are difficult to extract due to the fact that the pixels of the small target face image are too low (lower than 32 × 32 pixels), and in addition, the existing public data set does not have a low-pixel face recognition data set under a large-scale real environment, so that a network with discrimination capability is difficult to form.

Aiming at the difficulty of small target face recognition under the real environment, the two algorithms with the best performance at present adopt a knowledge distillation method based on a CNN network, and specifically comprise the following steps: the teacher network is a model based on the CNN network and is pre-trained on the high-pixel face image, and parameters are frozen in the training process and only serve as a feature extractor. The student network is consistent with the teacher network, and participates in training in the training process. During training, a high-pixel face image is input to a teacher network, a small target face image obtained by down-sampling the same high-pixel image is input to a student network, the penultimate layer characteristics of students are enabled to approach to a teacher corresponding layer through designing a loss function, so that the student model can obtain the high-pixel image transmitted by the teacher model in knowledge distillation to extract characteristic information, and meanwhile, the student model learns the information of the small target face image through the classification loss function of the student model. When a knowledge distillation loss function is designed, the traditional algorithm directly inputs the output characteristics of high-pixel face images and low-pixel face images into the loss function, so that the accuracy of high-pixel face image recognition is damaged, therefore, the traditional algorithm is improved on the basis of the traditional algorithm, and a parallel characteristic layer input loss function is added. Because the teacher network feature layer has good discrimination capability in the high-pixel face image, loss functions are designed through the teacher feature layer and the student feature layer, the student network can obtain expected same discrimination features on face images with the same ID but subjected to down-sampling, and therefore the discrimination capability of the student network in the low-pixel face image is improved.

Disclosure of Invention

The invention aims to solve the technical problem of providing a small target face recognition method based on the dest attached network knowledge distillation, which can effectively recognize small target face images.

The technical scheme adopted by the invention for solving the technical problems is as follows: the small target face recognition method based on the dest attached network knowledge distillation is provided and comprises the following steps:

step (1): constructing a deit network as a student network, preprocessing a selected training set and inputting the preprocessed training set into the student network to obtain a first classification characteristic and a first distillation characteristic;

step (2): selecting a teacher network which is pre-trained in the data set, and inputting the pre-processed training set to the teacher network to obtain a first teacher characteristic;

and (3): adding a residual error connecting module behind the last layer of discrimination layer of the teacher network, wherein the residual error connecting module participates in training;

and (4): constructing a first loss function according to the first classification characteristic and the real label, constructing a second loss function according to the first distillation characteristic and the first teacher characteristic, and adding the first loss function and the second loss function to obtain a first total loss;

under the first total loss, training a student network on a first face image by using the teacher network;

and (5): inputting a second face image to the trained student network to obtain a second classification characteristic and a second distillation characteristic;

the pixel resolution of the first face image is higher than that of the second face image;

and (6): inputting a second face image which is the same as the trained student network but is not subjected to down sampling into the teacher network to obtain second teacher characteristics;

and (7): constructing a third loss function according to the second classification characteristic and the real label, constructing a fourth loss function according to the second distillation characteristic and the second teacher characteristic, and adding the third loss function and the fourth loss function to obtain a second total loss;

under the second total loss, performing secondary training on the trained student network;

and (8): and recognizing the input new second face image by using the secondarily trained student network.

The step (1) is to input the selected training set into the student network after preprocessing, and specifically comprises the following steps: and (3) adjusting the size of each image in the training set to 224 x 224 by an interpolation method, cutting out 14 x 14 image blocks according to the size of 16 x 16, and inputting the cut image blocks into the student network.

In the step (1), the Vggface2 high-pixel face image is used as a training set.

The step (2) is specifically as follows: and selecting the SE + Resnet network which is pre-trained in the data set as a teacher network, fixing the SE + Resnet network parameters to enable the SE + Resnet network to be a feature extractor, and inputting the pre-trained training set into the teacher network to obtain first teacher features.

Inputting a second face image to the trained student network in the step (5), specifically: and inputting a downsampling 16 × 16 for the trained student network, and then downsampling a second face image which is amplified by interpolation to 224 × 224.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: the student network adopts a transformer structure with a non-CNN structure as a model framework, and combines each pixel point of an input image with the rest pixel point information of the whole image by using a transformer network non-local attention mechanism to learn the overall characteristics of the image, so that the performance loss of the model is far less than that of the CNN network framework when the model faces a low-pixel image after pre-training, and the problems of model performance loss and overfitting caused by the fact that the interpolation of a down-sampled image is in the same dimension with the high-pixel image are avoided; the invention parameterizes the knowledge that the teacher should teach by adding the auxiliary residual error connection module in the teacher network, avoids the problem of model capacity difference, changes the knowledge distillation method into online-offline combined knowledge distillation, obtains a stable and easily-converged model in a self-adaptive manner, and simultaneously enables the student network to absorb good information from the teacher network; the invention obtains 71.1% accuracy rate of a native low-pixel face data set Tinyface data set test set through the auxiliary knowledge distillation based on the bit network, and reaches the highest rate in an end-to-end face recognition algorithm without enhancing the test set.

Drawings

FIG. 1 is a schematic diagram of a bit network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a residual linking module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the overall architecture of a teacher network according to an embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The embodiment of the invention relates to a small target face recognition method based on bit attached network knowledge distillation, which specifically comprises the following steps:

1. constructing a deit network as a student network (detailed in figure 1), selecting Vggface2 high-pixel face images as a training set, adjusting the size of the images to 224 × 224 by an interpolation method, splicing and cutting 14 × 14 image blocks according to the size of 16 × 16 according to the images, and inputting the image blocks into the deit network to obtain a first classification characteristic and a first distillation characteristic.

In fig. 1, patch tokens are 768-dimensional features obtained by splicing and cutting images by 16 × 16 and then performing linear layer coding, and class tokens and distillation tokens are learnable embedded vectors having the same dimension as that of the patch tokens, wherein the class tokens are used for generating a discrimination layer for solving a loss function with a real label finally, and the distillation tokens are used for generating a discrimination layer for solving the loss function with teacher network output finally.

2. And (3) selecting the SE + Resnet network which is pre-trained in the Vggface2 data set as a teacher network, fixing SE + Resnet network parameters, enabling the SE + Resnet network to be a feature extractor, and inputting the same face image in the step (1) into the teacher network to obtain a first teacher feature.

3. Constructing a first loss function according to the first classification characteristic obtained in the step 1 and the real label, constructing a second loss function according to the first distillation characteristic obtained in the step 1 and the first teacher characteristic obtained in the step 2, adding the first loss function and the second loss function to obtain a first total loss, and under the first total loss, training a bit network on the high-pixel face image by using a teacher network. Here, the real tag is the person ID before downsampling, i.e., the real tag in step 1.

The steps 1 to 3 are used for pre-training the network by using the high-pixel face image information before training the network by using the downsampling face image information, so that the network recognizes and learns the basic features of the face, the features learned in the high-pixel face image are conveniently utilized during subsequent training by using the low-pixel face image, and the problem that the model is difficult to converge due to the fact that the low-pixel face image is directly trained and the task is too complex is solved.

4. And (3) adding an auxiliary residual error connecting module after the last layer of the judging layer of the teacher network in the step (2), wherein the parameters of the last layer of the judging layer and the previous part are still frozen and are used as a feature extractor, and the newly added residual error connecting module participates in training. Fig. 3 is a schematic diagram of the overall architecture of the teacher network incorporating the residual connection module.

Fig. 2 is a schematic diagram of an auxiliary residual error connection module, which can parameterize the knowledge that the teacher should teach by setting the auxiliary residual error connection module, so as to avoid the problem of model capacity difference, change the knowledge distillation method into online-offline combined knowledge distillation, obtain a stable and easily-converged model in a self-adaptive manner, and simultaneously enable the student network to absorb good information from the teacher network.

5. And (3) down-sampling the input of the trained exit model in the step 3 to be 16 × 16, and then interpolating and amplifying the down-sampled small target face image with 224 × 224 to obtain a second classification characteristic and a second distillation characteristic.

6. And (5) inputting the images which have the same identity as the student network in the step (5) but are not subjected to down sampling into the teacher network in the step (4) to obtain second teacher characteristics.

7. And constructing a third loss function according to the second classification characteristic obtained in the step 5 and the real label, constructing a fourth loss function according to the second distillation characteristic obtained in the step 5 and the second teacher characteristic obtained in the step 6, adding the third loss function and the fourth loss function to obtain a second total loss, and performing secondary training on the trained exit network under the second total loss. The real tag is the identity ID of the person after downsampling, and the identity of the person before and after downsampling is not changed, so the real tag is the same as the real tag in step 1 and step 3.

Further, the formulas of the first total loss and the second total loss in step 3 and step 7 may be expressed as follows:

L_global＝(1-λ)L_CE(ψ(Z_s),y)+λτ²KL(ψ(Z_s/τ),ψ(Z_t/τ))

wherein λ is a weighting coefficient for adjusting the sum of the first total loss or the second total loss, and is selected to be 0.5 in the embodiment; psi (-) is the softmax function, Z_sFor the output of the trained fail network, Z_tFor the output of the teacher network, y is a real label, namely the identity ID of a person corresponding to the face image, and τ is a degree coefficient of knowledge distillation, and the embodiment selects 1.25; by the pair Z_s、Z_tThe output of the teacher network and the output of the student network (the exit network) can be softened by dividing the knowledge distillation degree coefficient tau, so that the knowledge distillation is better carried out; psi (Z)_s/τ)、ψ(Z_tAnd tau) respectively outputting the softened teacher network and the softened student network through the output of the softmax function.

L_CE() Is a cross-over loss function and L_CE() Can be expressed as:

KL () for KL divergence can be expressed as:

8. and recognizing the input small target face image by using the secondarily trained exit network.

9. Through testing on the published data set Tinyface, the accuracy of the method reaches 71.1 percent of Rank-1, and the method is the highest accuracy method in the current algorithm of the data set, and specific results are shown in table 1.

Table 1 experimental results test comparison chart

Model (model)	Rank-1	Rank-20	mAP
				DeepId2	17.4	25.2	12.1
SphereFace	22.3	35.5	16.2
				VGGFace	30.4	40.4	23.1
CenterFace	32.1	44.5	24.6
				CSRI	45.2	60.2	39.9
T-C	58.6	73.0	52.7
				Shi	63.9	/	/
SafwanKhalid	70.4	82.2	63.2
				The present embodiment	71.13	84.09	64.58

Therefore, the student network adopts a transformer structure of a non-CNN structure as a model framework, and by utilizing a non-local attention mechanism of the transformer network, each pixel point of an input image is combined with the rest pixel point information of the whole image to learn the overall characteristics of the image, so that the performance loss of the model is far less than that of the CNN network framework when the model faces a low-pixel image after pre-training; the invention changes the knowledge distillation method into online-offline combined knowledge distillation by adding the auxiliary residual error connection module in the teacher network, obtains a stable and easily-converged model in a self-adaptive manner, and simultaneously enables the student network to absorb good information from the teacher network.

Claims

1. A small target face recognition method based on the dest attached network knowledge distillation is characterized by comprising the following steps:

2. The method for identifying the small target face based on the bit attached network knowledge distillation as claimed in claim 1, wherein the selected training set is preprocessed in the step (1) and then input into the student network, and specifically comprises: and (3) adjusting the size of each image in the training set to 224 x 224 by an interpolation method, cutting out 14 x 14 image blocks according to the size of 16 x 16, and inputting the cut image blocks into the student network.

3. The method for identifying small targets based on the distillation of the exit attached network knowledge as claimed in claim 1, wherein in the step (1), vgface 2 high-pixel face images are used as a training set.

4. The method for recognizing the small target face based on the bit attached network knowledge distillation as claimed in claim 1, wherein the step (2) is specifically as follows: and selecting the SE + Resnet network which is pre-trained in the data set as a teacher network, fixing the SE + Resnet network parameters to enable the SE + Resnet network to be a feature extractor, and inputting the pre-trained training set into the teacher network to obtain first teacher features.

5. The method for identifying small targets based on the bit attached network knowledge distillation as claimed in claim 1, wherein the step (5) is to input a second face image to the trained student network, specifically: and inputting a downsampling 16 × 16 for the trained student network, and then downsampling a second face image which is amplified by interpolation to 224 × 224.