CN113947801B

CN113947801B - Face recognition method and device and electronic equipment

Info

Publication number: CN113947801B
Application number: CN202111567512.2A
Authority: CN
Inventors: 王金桥; 赵朝阳; 郭凯文
Original assignee: Objecteye Beijing Technology Co Ltd
Current assignee: Objecteye Beijing Technology Co Ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-07-26
Anticipated expiration: 2041-12-21
Also published as: CN113947801A

Abstract

The invention provides a face recognition method, a face recognition device and electronic equipment, wherein the method comprises the following steps: determining an image to be identified; inputting an image to be recognized into a face recognition model fusing a plurality of scenes to obtain a face recognition result output by the face recognition model; the face recognition model is obtained by performing distillation training on the basis of the sample image and face recognition results of all scenes output by the teacher model corresponding to all scenes. According to the invention, a face recognition model fusing a plurality of scenes is obtained through a distillation training mode, face recognition is carried out based on the face recognition model, and the multi-scene face recognition effect realized based on the same model is improved while the scale of the model is compressed and the operation amount is reduced, so that the accurate and reliable face recognition scheme which can be applied to different scenes is realized.

Description

Face recognition method and device and electronic equipment

Technical Field

The invention relates to the technical field of computer vision, in particular to a face recognition method and device and electronic equipment.

Background

In the face data set disclosed so far, basically, the face data set is an image of a celebrity, and the number of images of each person is large. However, in an actual business scenario, the number of images of some persons is small (for example, about 2-3), the images of these tasks are gathered with public face data and a face recognition model is trained, which is a case that the model is over-fitted, and the domain difference of the two kinds of data is large, which seriously affects the generalization of the model, and further makes the recognition accuracy of the trained face recognition model low.

Disclosure of Invention

The invention provides a face recognition method, a face recognition device and electronic equipment, which are used for solving the defects of low recognition accuracy and poor generalization of a face recognition model in the prior art.

The invention provides a face recognition method, which comprises the following steps:

determining an image to be recognized;

inputting the image to be recognized into a face recognition model fusing a plurality of scenes to obtain a face recognition result output by the face recognition model;

the face recognition model is obtained by performing distillation training on the basis of the sample image and face recognition results of all scenes output by the teacher model corresponding to all scenes.

According to the face recognition method provided by the invention, the teacher model corresponding to each scene is obtained by training based on the following steps:

determining a sample scene image set corresponding to each scene, wherein the set comprises a plurality of sample scene images;

determining a normal sample and a virtual sample corresponding to each scene from a sample scene image set corresponding to each scene based on the number of times of the characters contained in each sample scene image corresponding to each scene appearing in the set;

training an original model of a teacher model based on a normal sample and a virtual sample corresponding to each scene, and identity information of faces contained in the normal sample and the virtual sample to obtain the teacher model of each scene.

According to a face recognition method provided by the present invention, the determining, based on the number of times that characters included in each sample scene image corresponding to each scene appear in the set, a normal sample and a virtual sample corresponding to each scene from a sample scene image set corresponding to each scene includes:

when the number of times of appearance of the characters in the set, which are contained in each sample scene image corresponding to each scene, is more than or equal to a threshold value, taking the corresponding sample scene image as a normal sample of the corresponding scene; and when the number of times of appearance of the characters in the set, which are contained in each sample scene image corresponding to each scene, is less than the threshold value, taking the corresponding sample scene image as a virtual sample of the corresponding scene.

According to the face recognition method provided by the invention, the training of the original model of the teacher model based on the normal sample and the virtual sample corresponding to each scene and the identity information of the face contained in the normal sample and the virtual sample to obtain the teacher model of each scene comprises the following steps:

training an original model of the teacher model based on normal samples corresponding to the scenes and identity information including faces in the normal samples to obtain initial models corresponding to the teacher model of the scenes;

inputting the virtual sample corresponding to each scene into the initial model corresponding to the teacher model of each scene to obtain a face recognition result corresponding to each scene virtual sample output by the initial model corresponding to the teacher model of each scene;

inputting the normal samples of each scene into the initial model corresponding to the teacher model of each scene to obtain the face recognition result corresponding to the normal samples of each scene output by the initial model corresponding to the teacher model of each scene;

training the initial model corresponding to the teacher model of each scene based on the face recognition result corresponding to each scene virtual sample, the face recognition result corresponding to each scene normal sample, the identity information of each scene virtual sample containing the face and the identity information of each scene normal sample containing the face to obtain the teacher model of each scene.

According to the face recognition method provided by the invention, the virtual samples corresponding to the scenes are input into the initial models corresponding to the teacher model of the scenes to obtain the face recognition results corresponding to the virtual samples of the scenes output by the initial models corresponding to the teacher model of the scenes, and the face recognition method comprises the following steps:

inputting the virtual samples corresponding to the scenes into a feature extraction layer in an initial model corresponding to a teacher model of each scene to obtain feature vectors of the virtual samples of each scene output by the feature extraction layer and feature templates corresponding to the virtual samples of each scene; the characteristic template corresponding to each scene virtual sample is the average vector of the characteristic vectors of the virtual samples of each scene;

inputting the feature vectors of the virtual samples of each scene and the feature templates corresponding to the virtual samples of each scene into a face recognition layer in an initial model corresponding to a teacher model of each scene to obtain a face recognition result corresponding to the virtual samples of each scene output by the face recognition layer;

the inputting the normal samples of each scene into the initial model corresponding to the teacher model of each scene to obtain the face recognition result corresponding to the normal samples of each scene output by the initial model corresponding to the teacher model of each scene includes:

inputting the normal sample corresponding to each scene into a feature extraction layer in the initial model corresponding to the teacher model of each scene to obtain the feature vector of the normal sample of each scene output by the feature extraction layer;

and inputting the characteristic vectors of the normal samples of each scene into a face recognition layer in the initial model corresponding to the teacher model of each scene to obtain a face recognition result corresponding to the normal samples of each scene output by the face recognition layer.

According to the face recognition method provided by the invention, the sample scene image corresponding to each scene is determined based on the following steps:

acquiring initial sample scene images corresponding to all scenes, performing face key point detection on the initial sample scene images, and determining face detection frame images of all the initial sample scene images;

and carrying out face alignment based on the face key point information in each face detection frame image, and cutting each image after face alignment into a preset size to obtain a sample scene image corresponding to each scene.

According to the face recognition method provided by the invention, the face recognition model is obtained by training based on the following steps:

determining an initial model fusing a plurality of scenes;

and taking the initial model fused with the plurality of scenes as a student model, and carrying out distillation training on the student model based on the sample image and the face recognition result of each scene output by the teacher model corresponding to each scene to obtain the face recognition model.

According to the face recognition method provided by the invention, the distilling training is performed on the student model based on the sample image and the face recognition result of each scene output by the teacher model corresponding to each scene to obtain the face recognition model, and the method comprises the following steps:

inputting the sample scene images of all scenes in the sample images to the teacher model of the corresponding scene to obtain face recognition results of all scenes output by the teacher model of all scenes;

inputting sample scene images of all scenes in the sample images into the student model to obtain student identification results of all scenes output by the student model;

determining a loss function of the student model based on the recognition result of each scene student and the face recognition result of each scene;

and training the student model based on the loss function of the student model to obtain the face recognition model.

The present invention also provides a face recognition apparatus, comprising:

a determination unit for determining an image to be recognized;

the recognition unit is used for inputting the image to be recognized into a face recognition model fusing a plurality of scenes to obtain a face recognition result output by the face recognition model;

the face recognition model is obtained by performing distillation training on the basis of the sample image, the face recognition result of each scene output by the teacher model corresponding to each scene and the teacher model corresponding to each scene.

The present invention also provides an electronic device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of any of the above-mentioned face recognition methods when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the face recognition method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the face recognition method as described in any one of the above.

According to the face recognition method, the face recognition device and the electronic equipment, the face recognition model fusing a plurality of scenes is obtained through a distillation training mode, face recognition is carried out based on the face recognition model, the scale of the model is compressed, the operation amount is reduced, and meanwhile the multi-scene face recognition effect realized based on the same model is improved, so that the face recognition scheme which is accurate and reliable and can be applied to different scenes is realized.

Drawings

In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow chart of a face recognition method provided by the present invention;

FIG. 2 is a schematic flow chart of a teacher model training method provided by the present invention;

FIG. 3 is a schematic flow chart of a sample scene image acquisition method provided by the present invention;

FIG. 4 is a flow chart of a student model training method provided by the invention;

FIG. 5 is a schematic structural diagram of a face recognition apparatus provided in the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the face data set disclosed so far, basically, the face data set is an image of a famous person, and the number of images of each person is large. However, in an actual business scenario, the number of images of some people is small (for example, about 2-3), the images of these tasks are gathered with public face data and a face recognition model is trained, which is a case that the model is over-fitted, and the domains of the two kinds of data are greatly different, which seriously affects the generalization of the model, and further makes the recognition accuracy of the trained face recognition model low.

In view of the above, the present invention provides a face recognition method. Fig. 1 is a schematic flow chart of a face recognition method provided by the present invention, and as shown in fig. 1, the method includes the following steps:

step 110, determining an image to be identified.

Specifically, the image to be recognized is an image that needs to be subjected to face recognition, and the image to be recognized may be an image acquired by a camera device or an image directly input by a user.

It can be understood that after the image to be recognized is determined, the face key point detection can be performed on the image to be recognized to obtain a face detection frame, and then the face alignment is performed, so that the image after the face alignment is a front face, and then the face recognition can be performed on the subsequent face recognition model, and the face recognition efficiency and precision can be improved.

Step 120, inputting an image to be recognized into a face recognition model fusing a plurality of scenes to obtain a face recognition result output by the face recognition model;

Specifically, the teacher models corresponding to the scenes are face recognition models obtained through image training in different scenes, and each teacher model can perform face recognition in a high-precision manner in the corresponding scene. For the situation that face recognition needs to be performed in different scenes, a face recognition model fusing multiple scenes can be adopted to perform face recognition on the image to be recognized. The face recognition model fused with a plurality of scenes is a student model obtained by distilling and training a teacher model corresponding to each scene, so that face information learned in the corresponding scene can be migrated to the student model by each teacher model, and the trained face recognition model can recognize faces in different scenes.

In addition, each teacher model is obtained based on image training in the corresponding scene, and the number of images of each character in the corresponding scene is relatively balanced, so that the situation that the models are over-fitted after combined training due to the fact that the number of images of specific characters in an actual scene is too small as the number of images of celebrities in public data in a traditional method is not too large is avoided. In other words, in the embodiment of the invention, each teacher model is trained in a specific scene, so that each teacher model has better identification performance in the specific scene, and the student models obtained by distillation learning can learn face identification information in different scenes, that is, the trained face identification model can solve the problem of poor face identification generalization performance in various scenes.

Moreover, the teacher model of each scene is a model which has a larger scale, is more complex and has a better task execution effect compared with the student model. A single teacher model can transfer knowledge in the teacher model to the student model based on the thought of a teacher-student network, so that the network performance of the student model is improved, and the knowledge transfer process is knowledge distillation. The knowledge of the teacher models in multiple scenes is transferred to the same face recognition model, so that the performance of the face recognition model is closer to that of each teacher model.

Before step 120 is executed, the face recognition model may be obtained by pre-training, the specific training mode is distillation training, and the specific training step may include: first, a large number of sample images are acquired. In addition, teacher models corresponding to the scenes are obtained. And then, distilling and training the initial model containing the face recognition model based on the sample image and the face recognition result of each scene output by the teacher model corresponding to each scene, thereby obtaining the face recognition model.

The face recognition method provided by the embodiment of the invention obtains the face recognition model fusing a plurality of scenes through a distillation training mode, carries out face recognition based on the face recognition model, and improves the multi-scene face recognition effect realized based on the same model while compressing the scale of the model and reducing the operation amount, thereby realizing the accurate and reliable face recognition scheme which can be applied to different scenes.

Based on the above embodiment, the teacher model corresponding to each scene is obtained by training based on the following steps:

determining a normal sample and a virtual sample corresponding to each scene from a sample scene image set corresponding to each scene based on the number of times of appearance of characters in the set, wherein the characters are contained in each sample scene image corresponding to each scene;

and training the original model of the teacher model based on the normal sample and the virtual sample corresponding to each scene and the identity information of the face contained in the normal sample and the virtual sample to obtain the teacher model of each scene.

Specifically, teacher models corresponding to different scenes can acquire a sample scene image set from the corresponding scene for training. However, the number of occurrences of different people contained in each sample scene image in these sets may be different. When the number of times of the contained people appearing in the set is large, the correlation degree between the corresponding sample scene image and the corresponding scene is high, the sample scene image can be used as a normal sample, a normal sample label is added to the normal sample, and the normal sample label is used for representing the identity information of the face contained in the corresponding normal sample. When the number of times of the included characters appearing in the set is small, it indicates that the degree of correlation between the corresponding sample scene image and the corresponding scene is low, i.e., the sample scene image can be regarded as a weak correlation class, and therefore the sample scene image can be used as a virtual sample. For example, in the sample scene image set corresponding to the a scene, if the number of times of occurrence of the person 1 is 6 times >5 times, the corresponding sample scene image may be used as a normal sample; the number of times of appearance of the person 2 is 3 <5 times, and the corresponding sample scene image may be taken as a virtual sample.

After the normal sample and the virtual sample are obtained, the original model of the teacher model can be trained based on the identity information of the faces contained in the normal sample, the virtual sample and the normal sample and the virtual sample, and the teacher model capable of accurately recognizing the faces in the corresponding scene is obtained.

The original model of the teacher model can be obtained by training based on a public data set, for example, celebrity images can be collected on a public website by a crawler method, the number of each celebrity image is more than 5, then the original model for face recognition is obtained by training using a convolutional neural network, and the original model can be trained by using a CosFace or ArcFace classification loss function.

Based on any of the embodiments, determining the normal sample and the virtual sample corresponding to each scene from the sample scene image set corresponding to each scene based on the number of times that the characters included in each sample scene image corresponding to each scene appear in the set includes:

when the number of times of appearance of the characters in the set, which are contained in each sample scene image corresponding to each scene, is more than or equal to a threshold value, taking the corresponding sample scene image as a normal sample of the corresponding scene; and when the number of times of the appearance of the characters in the set, which are contained in each sample scene image corresponding to each scene, is smaller than a threshold value, taking the corresponding sample scene image as a virtual sample of the corresponding scene.

Specifically, when the number of times of the included people appearing in the set is large, it indicates that the correlation between the corresponding sample scene image and the corresponding scene is high, and the sample scene image may be used as a normal sample, and a normal sample label is added to the normal sample. When the number of times of the included characters appearing in the set is small, it indicates that the degree of correlation between the corresponding sample scene image and the corresponding scene is low, i.e., the sample scene image can be regarded as a weak correlation class, and therefore the sample scene image can be used as a virtual sample.

For example, when the number of corresponding sample scene images of any one person in any scene is greater than or equal to 5, the corresponding sample scene images can be classified into a class of normal samples; when the number of corresponding sample scene images of any person in any scene is less than 5, the corresponding sample scene images can be classified into a class of virtual samples.

It should be noted that when the number of times that the person included in each sample scene image corresponding to each scene appears in the set is smaller than the threshold, if such an image and the sample scene image in which the number of times that the person included in each sample scene image corresponding to each scene appears in the set is greater than or equal to the threshold are used together as a normal sample for training, because the difference between the numbers of the two images is large, it can be understood that the difference between the domains of the two images is large, and therefore the problem of overfitting occurs to the teacher model in the corresponding scene.

Therefore, the normal sample and the virtual sample corresponding to each scene are determined from the sample scene image set corresponding to each scene based on the number of times of the characters in the set, wherein the characters are contained in the sample scene images corresponding to each scene, so that the problem of overfitting of a teacher model in the corresponding scene can be avoided, and the accuracy of face recognition can be improved.

Based on any of the above embodiments, training an original model of a teacher model based on a normal sample, a virtual sample, and identity information that the normal sample and the virtual sample corresponding to each scene contain a human face to obtain the teacher model of each scene, includes:

training an original model of the teacher model based on the normal sample corresponding to each scene and the identity information of the face contained in the normal sample to obtain an initial model corresponding to the teacher model of each scene;

inputting the virtual samples corresponding to the scenes into the initial models corresponding to the teacher models of the scenes to obtain face recognition results corresponding to the virtual samples of the scenes output by the initial models corresponding to the teacher models of the scenes;

training an initial model corresponding to the teacher model of each scene based on the face recognition result corresponding to each scene virtual sample, the face recognition result corresponding to each scene normal sample, the identity information of each scene virtual sample containing the face and the identity information of each scene normal sample containing the face to obtain the teacher model of each scene.

Specifically, the original model of the teacher model may be trained based on public data sets, for example, images of celebrities, each of which is greater than 5, may be collected on a public website by a crawler method, and then the original model for face recognition may be trained using a convolutional neural network.

After the original model of the teacher model is obtained, the original model of the teacher model is trained based on the normal samples of the scenes and the corresponding normal sample labels (the normal samples contain the identity information of the human face), so that the original model can be finely adjusted, and the original model of the teacher model in the corresponding scene is obtained.

Then, inputting the virtual samples of each scene into the initial model of the teacher model in the corresponding scene to obtain the face recognition results corresponding to the virtual samples of each scene output by the initial model of the teacher model in the corresponding scene; meanwhile, the normal samples of the scenes are input into the initial models corresponding to the teacher model of the scenes, and face recognition results corresponding to the normal samples of the scenes output by the initial models corresponding to the teacher model of the scenes are obtained.

Then, training the initial model of the teacher model of each scene based on the face recognition result corresponding to each scene virtual sample, the face recognition result corresponding to each scene normal sample, the identity information that each scene virtual sample contains the face, and the identity information that each scene normal sample contains the face, so as to obtain the teacher model in the corresponding scene.

Based on any of the embodiments, inputting the virtual sample corresponding to each scene into the initial model corresponding to the teacher model of each scene to obtain the face recognition result corresponding to the virtual sample of each scene output by the initial model corresponding to the teacher model of each scene, including:

inputting the virtual samples corresponding to the scenes into a feature extraction layer in the initial model corresponding to the teacher model of each scene to obtain feature vectors of the virtual samples of each scene output by the feature extraction layer and feature templates corresponding to the virtual samples of each scene; the characteristic template corresponding to each scene virtual sample is the average vector of the characteristic vectors of the virtual samples of each scene;

inputting the feature vectors of the virtual samples of each scene and the feature templates corresponding to the virtual samples of each scene into a face recognition layer in the initial model corresponding to the teacher model of each scene to obtain face recognition results corresponding to the virtual samples of each scene output by the face recognition layer;

inputting the normal samples of each scene into the initial model corresponding to the teacher model of each scene, and obtaining the face recognition result corresponding to the normal samples of each scene output by the initial model corresponding to the teacher model of each scene, including:

inputting the normal sample corresponding to each scene into the feature extraction layer in the initial model corresponding to the teacher model of each scene to obtain the feature vector of the normal sample of each scene output by the feature extraction layer;

and inputting the feature vectors of the normal samples of each scene into a face recognition layer in the initial model corresponding to the teacher model of each scene to obtain face recognition results corresponding to the normal samples of each scene output by the face recognition layer.

Specifically, the virtual samples corresponding to each scene are input into the feature extraction layer in the initial model corresponding to the teacher model of each scene, feature extraction is performed by the feature extraction layer to obtain feature vectors of the virtual samples of each scene, an average vector of the feature vectors of the virtual samples of each scene is obtained to serve as a feature template corresponding to the virtual samples of the corresponding scene, and virtual sample labels are added to the virtual samples of each scene (the virtual samples include identity information of human faces).

And then inputting the feature vectors of the virtual samples of each scene and the feature templates corresponding to the virtual samples of each scene into a face recognition layer in the initial model corresponding to the teacher model of each scene, and determining the face recognition result corresponding to the virtual samples of each scene by the face recognition layer based on the feature vectors of the virtual samples of each scene and the feature templates corresponding to the virtual samples of each scene.

Similarly, the normal samples corresponding to the scenes are input into the feature extraction layer in the initial model corresponding to the teacher model of the scenes, and feature extraction is performed by the feature extraction layer to obtain the feature vectors of the normal samples of the scenes.

And then inputting the feature vectors of the normal samples of each scene and the feature templates corresponding to the virtual samples of each scene into a face recognition layer in the initial model corresponding to the teacher model of each scene, and determining the face recognition result corresponding to the normal samples of each scene by the face recognition layer based on the feature vectors of the normal samples of each scene and the feature templates corresponding to the virtual samples of each scene.

As shown in fig. 2, public data images (such as celebrity images) are collected on public websites, and then the original model for face recognition is obtained by training using a convolutional neural network. And then dividing the sample scene images of all scenes into normal sample images and virtual sample images, and training the original model by adopting the normal sample images and the normal sample labels, thereby realizing fine adjustment of the original model and obtaining the initial model of the teacher model corresponding to the scenes. Then, feature extraction is carried out on the virtual sample image by using the initial model to obtain a virtual sample feature vector, an average vector of the virtual sample feature vector is obtained to serve as a corresponding feature template of the corresponding scene virtual sample, and a virtual sample face recognition result is determined based on the feature vector of each scene virtual sample and the feature template corresponding to each scene virtual sample. Meanwhile, the initial model is used for carrying out feature extraction on the normal sample image to obtain a normal sample feature vector, and a normal sample face recognition result is determined based on the normal sample feature vector. Then, training the initial model of the teacher model of each scene based on the face recognition result corresponding to each scene virtual sample, the face recognition result corresponding to each scene normal sample, the identity information that each scene virtual sample contains the face, and the identity information that each scene normal sample contains the face, so as to obtain the teacher model in the corresponding scene.

When the initial model of the teacher model of each scene is trained, the virtual sample image indexes are arranged after the normal sample images and are sequentially added, the weight parameters of the classification layer (such as a full connection layer) corresponding to the normal sample images are initialized randomly, the random gradient is used for descending and updating the parameters, the weight of the classification layer corresponding to the virtual sample images is not updated, and the characteristic template corresponding to the virtual sample of the corresponding scene is selected to fill the weight parameters of the classification layer of the virtual class after each iteration. The ratio of the normal sample image to the virtual sample image may be 3: 1. In addition, the initial model of the teacher model can be trained by adopting a smaller learning rate, and the loss function can adopt CosFace or ArcFace, so that the teacher model with obviously improved performance in a corresponding scene can be obtained.

Based on any of the above embodiments, the sample scene image corresponding to each scene is determined based on the following steps:

Specifically, the initial sample scene image may be an image randomly acquired in a corresponding scene, and a face may or may not be present in the image. When a human face exists, the corresponding human face image may not be a front face image. In order to facilitate the training of the model, face keypoint detection may be performed on the initial sample scene image, and when no face exists or the face size is smaller than the target size (e.g., 60 × 60), the corresponding initial sample scene image may be filtered out.

When a face exists, a corresponding face detection frame image can be obtained, affine transformation is performed according to key point information (such as eyes, a nose, a mouth corner and the like) in the face detection frame image to align the face, a front face image is obtained, and after the front face image is obtained, the front face image is cut into a preset size (such as 112 × 112), so that a sample scene image is obtained. When the human face is aligned, the human face can be aligned into a front face through operations such as translation, rotation and zooming according to key points.

As shown in fig. 3, performing face key point detection on an initial sample scene image, determining whether a face meets requirements, if a face exists and the face size meets the requirements, performing affine transformation according to key point information to achieve face alignment if the face meets the requirements, and cutting an aligned face image into a fixed size.

Based on any of the above embodiments, the face recognition model is obtained by training based on the following steps:

determining an initial model fusing a plurality of scenes;

Specifically, when distillation training is performed on the teacher model corresponding to each scene, two feasible schemes are available, one scheme is to perform one-to-one distillation training on the teacher model and the student models, namely, the distillation training is performed on the student models respectively aiming at the teacher model of each scene to obtain the student models under each scene, and then the student models under each scene are fused and compressed to obtain a face recognition model; and the other method is to directly carry out combined distillation training on teacher models of a plurality of scenes and compress the teacher models to obtain a face recognition model.

Considering that the first scheme may cause knowledge loss in the single-task distillation process, and the knowledge loss in the single-task distillation process is propagated to the fusion compression stage of the plurality of student models in the form of cascade loss, the second scheme is preferred in the embodiment of the invention, namely, the teacher models of the plurality of scenes are directly compressed and fused into one face recognition model.

Based on any of the above embodiments, based on the sample image and the face recognition result of each scene output by the teacher model corresponding to each scene, the distillation training is performed on the student model to obtain a face recognition model, including:

inputting the sample scene images of all scenes in the sample images into the teacher model of the corresponding scene to obtain face recognition results of all scenes output by the teacher model of all scenes;

inputting sample scene images of all scenes in the sample images into a student model to obtain student identification results of all scenes output by the student model;

determining a loss function of the student model based on the recognition result of the student in each scene and the face recognition result of each scene;

and training the student model based on the loss function of the student model to obtain a face recognition model.

As shown in fig. 4, according to actual needs, teacher models (e.g., n teacher models) of a plurality of scenes may be set, then corresponding sample scene images are respectively input to the teacher models of the corresponding scenes, teacher feature vectors of the corresponding scenes are extracted, and scene face recognition results output by the teacher models of the respective scenes are determined based on the teacher feature vectors, and the scene face recognition results are used for guiding training of the student models. Meanwhile, a sample scene image of the corresponding scene is input into the student model, student characteristic vectors are obtained through extraction, a student identification result is determined based on the student characteristic vectors, then a loss function is determined based on the student identification result and a face identification result of the corresponding scene, training is conducted based on a loss function optimization model, and a face identification model is obtained. The loss function may be determined based on a Mean Squared Error (MSE) between the student recognition result and each scene face recognition result output by each scene teacher model.

The following describes the face recognition apparatus provided by the present invention, and the face recognition apparatus described below and the face recognition method described above may be referred to in correspondence with each other.

Based on any of the above embodiments, the present invention further provides a face recognition apparatus, as shown in fig. 5, the apparatus includes:

a determination unit 510 for determining an image to be recognized;

the recognition unit 520 is configured to input the image to be recognized into a face recognition model fusing multiple scenes, and obtain a face recognition result output by the face recognition model;

Based on any of the above embodiments, the apparatus further comprises:

the system comprises a sample determining unit, a calculating unit and a processing unit, wherein the sample determining unit is used for determining a sample scene image set corresponding to each scene, and the set comprises a plurality of sample scene images;

the classification unit is used for determining a normal sample and a virtual sample corresponding to each scene from a sample scene image set corresponding to each scene based on the number of times of appearance of characters contained in each sample scene image corresponding to each scene in the set;

and the original model training unit is used for training the original model of the teacher model based on the normal sample and the virtual sample corresponding to each scene and the identity information of the face contained in the normal sample and the virtual sample to obtain the teacher model of each scene.

Based on any embodiment, the classification unit is configured to:

Based on any of the above embodiments, the original model training unit includes:

the initial model training unit is used for training an original model of the teacher model based on a normal sample corresponding to each scene and identity information of faces contained in the normal sample to obtain an initial model corresponding to the teacher model of each scene;

the virtual sample identification unit is used for inputting the virtual sample corresponding to each scene into the initial model corresponding to the teacher model of each scene to obtain the face identification result corresponding to the virtual sample of each scene output by the initial model corresponding to the teacher model of each scene;

the normal sample identification unit is used for inputting the normal samples of all the scenes into the initial models corresponding to the teacher model of all the scenes to obtain face identification results corresponding to the normal samples of all the scenes, and the face identification results are output by the initial models corresponding to the teacher model of all the scenes;

and the teacher model training unit is used for training the initial model corresponding to the teacher model of each scene based on the face recognition result corresponding to each scene virtual sample, the face recognition result corresponding to each scene normal sample, the identity information of each scene virtual sample containing the face and the identity information of each scene normal sample containing the face to obtain the teacher model of each scene.

Based on any of the above embodiments, the virtual sample identification unit includes:

the first feature extraction unit is used for inputting the virtual samples corresponding to the scenes into a feature extraction layer in the initial model corresponding to the teacher model of the scenes to obtain feature vectors of the virtual samples of the scenes output by the feature extraction layer and feature templates corresponding to the virtual samples of the scenes; the characteristic template corresponding to each scene virtual sample is an average vector of the characteristic vectors of the virtual samples of each scene;

the first identification unit is used for inputting the feature vectors of the virtual samples of each scene and the feature templates corresponding to the virtual samples of each scene into a face identification layer in an initial model corresponding to a teacher model of each scene to obtain a face identification result corresponding to the virtual samples of each scene output by the face identification layer;

the normal sample recognition unit includes:

the second feature extraction unit is used for inputting the normal samples corresponding to the scenes into the feature extraction layer in the initial model corresponding to the teacher model of the scenes to obtain feature vectors of the normal samples of the scenes output by the feature extraction layer;

and the second identification unit is used for inputting the feature vectors of the normal samples of each scene into the face identification layer in the initial model corresponding to the teacher model of each scene to obtain the face identification result corresponding to the normal samples of each scene output by the face identification layer.

Based on any embodiment above, the apparatus further comprises:

the face detection unit is used for acquiring initial sample scene images corresponding to all scenes, detecting face key points of the initial sample scene images and determining face detection frame images of all the initial sample scene images;

and the face alignment unit is used for carrying out face alignment on the basis of the face key point information in each face detection frame image, cutting each image after face alignment into a preset size, and obtaining a sample scene image corresponding to each scene.

Based on any of the above embodiments, the apparatus further comprises:

an initial model determining unit for determining an initial model fusing a plurality of scenes;

and the face recognition model training unit is used for taking the initial model fused with the plurality of scenes as a student model, and carrying out distillation training on the student model based on the sample image and the face recognition result of each scene output by the teacher model corresponding to each scene to obtain the face recognition model.

Based on any of the above embodiments, the face recognition model training unit includes:

the first input unit is used for inputting the sample scene images of all scenes in the sample images into the teacher model of the corresponding scene to obtain face recognition results of all scenes output by the teacher model of all scenes;

the second input unit is used for inputting the sample scene images of all scenes in the sample images into the student model to obtain student identification results of all scenes output by the student model;

the loss function determining unit is used for determining a loss function of the student model based on the recognition result of each scene student and the face recognition result of each scene;

and the training subunit is used for training the student model based on the loss function of the student model to obtain the face recognition model.

Fig. 6 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 6, the electronic device may include: a processor (processor)610, a memory (memory)620, a communication Interface (Communications Interface)630 and a communication bus 640, wherein the processor 610, the memory 620 and the communication Interface 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 620 to perform a face recognition method comprising: determining an image to be recognized; inputting the image to be recognized into a face recognition model fusing a plurality of scenes to obtain a face recognition result output by the face recognition model; the face recognition model is obtained by performing distillation training on the basis of the sample image and face recognition results of all scenes output by the teacher model corresponding to all scenes.

In addition, the logic instructions in the memory 620 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product including a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer being capable of executing the face recognition method provided by the above methods, the method including: determining an image to be identified; inputting the image to be recognized into a face recognition model fusing a plurality of scenes to obtain a face recognition result output by the face recognition model; the face recognition model is obtained by performing distillation training on the basis of the sample image and face recognition results of all scenes output by the teacher model corresponding to all scenes.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the face recognition methods provided above, the method including: determining an image to be identified; inputting the image to be recognized into a face recognition model fusing a plurality of scenes to obtain a face recognition result output by the face recognition model; the face recognition model is obtained by performing distillation training on the basis of the sample image and face recognition results of all scenes output by the teacher model corresponding to all scenes.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A face recognition method, comprising:

determining an image to be identified;

the face recognition model is obtained by performing distillation training on the basis of the sample image and face recognition results of all scenes output by the teacher model corresponding to all scenes;

the teacher model corresponding to each scene is obtained by training based on the following steps:

determining a normal sample and a virtual sample corresponding to each scene from a sample scene image set corresponding to each scene based on the number of times of appearance of characters contained in each sample scene image corresponding to each scene in the set;

training an original model of a teacher model based on a normal sample and a virtual sample corresponding to each scene and identity information of faces contained in the normal sample and the virtual sample to obtain the teacher model of each scene.

2. The method according to claim 1, wherein the determining a normal sample and a virtual sample corresponding to each scene from a sample scene image set corresponding to each scene based on the number of times that a person included in each sample scene image corresponding to each scene appears in the set comprises:

3. The method of claim 1, wherein the training an original model of a teacher model based on a normal sample and a virtual sample corresponding to each scene, and the identity information that the normal sample and the virtual sample contain a human face, to obtain the teacher model of each scene comprises:

training an original model of the teacher model based on a normal sample corresponding to each scene and identity information containing a face in the normal sample to obtain an initial model corresponding to the teacher model of each scene;

4. The method according to claim 3, wherein the step of inputting the virtual sample corresponding to each scene into the initial model corresponding to the teacher model of each scene to obtain the face recognition result corresponding to each scene virtual sample output by the initial model corresponding to the teacher model of each scene comprises:

5. The face recognition method of claim 1, wherein the sample scene image corresponding to each scene is determined based on the steps of:

6. The face recognition method according to any one of claims 1 to 5, wherein the face recognition model is trained based on the following steps:

determining an initial model fusing a plurality of scenes;

7. The method according to claim 6, wherein the performing distillation training on the student model based on the sample image and the face recognition result of each scene output by the teacher model corresponding to each scene to obtain the face recognition model comprises:

8. A face recognition apparatus, comprising:

a determination unit for determining an image to be recognized;

further comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the face recognition method according to any one of claims 1 to 7 are implemented when the processor executes the program.