CN113850243A

CN113850243A - Model training method, face recognition method, electronic device and storage medium

Info

Publication number: CN113850243A
Application number: CN202111438776.8A
Authority: CN
Inventors: 胡长胜; 浦煜; 付贤强; 何武; 朱海涛; 户磊
Original assignee: Beijing Dilusense Technology Co Ltd; Hefei Dilusense Technology Co Ltd
Current assignee: Beijing Dilusense Technology Co Ltd; Hefei Dilusense Technology Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2021-12-28

Abstract

The embodiment of the invention relates to the field of face recognition, and discloses a model training method, a face recognition method, electronic equipment and a storage medium, wherein the model training comprises the following steps: acquiring image samples containing faces under different crowd categories, and labeling category labels of the crowd categories to which the image samples belong; performing original training on a pre-established face recognition model based on the triple sample constructed by the image sample to obtain a trained face recognition model; taking a feature map output by any layer of the feature extraction network in the face recognition model as input, and additionally arranging a crowd category branch network to form an intermediate model; and training the crowd category branch network based on the image sample and the category label of the image sample to obtain the intermediate model after training. The scheme can effectively solve the problem that the existing face recognition product adopting the same face recognition algorithm cannot well solve the problem of high rejection rate or high false recognition rate caused by crowd category difference.

Description

Model training method, face recognition method, electronic device and storage medium

Technical Field

The present invention relates to the field of face recognition, and in particular, to a model training method, a face recognition method, an electronic device, and a storage medium.

Background

With the popularization of the face recognition technology in daily life, the application scenes faced by the face recognition technology are more and more complex. The traditional face recognition methods based on a single scene and a single group have exposed limitations. When the population in the application scenario contains multiple families (yellow, white, black, etc.) or multiple age groups (children, adults, the elderly), there are two common solutions:

the first scheme is as follows: different face recognition algorithms are trained aiming at different crowd categories, and then a specific face recognition model and a recognition threshold are selected according to an estimation model with an evaluation classification function (such as ethnicity classification or age group classification) to complete face recognition.

Scheme II: the differences among the crowd categories are not distinguished, and the whole recognition model is adopted for face recognition.

However, both of the above solutions have some drawbacks:

although the first scheme can well solve the problem of face recognition of different crowd categories under complex scenes on the basis of methods and results, the whole system is complex, the algorithm development workload is large, and the requirements on computing power and storage of a hardware platform are high.

In the second scheme, for the weight type recognition model, due to the capability of the model (capacity), the weight type recognition model is trained together for various crowd categories during training, and the overall recognition effect of the weight type recognition model can possibly meet the scene requirement. However, in essence, due to the specificity of the various types of population, the corresponding feature spaces of the various types of population, whether they are weight models or light models, cannot be completely overlapped and overlap each other to some extent. Therefore, if the identification is directly performed without distinguishing, the experience on a specific population is poor, such as higher rejection rate or higher false recognition rate. Meanwhile, due to the inherent defect of overlapping of feature spaces, after a traditional machine learning method is used for clustering features or an additional evaluation classification model is used for distinguishing groups, although a part of conditions of refusal or false recognition can be improved, all scene requirements still cannot be met, and particularly the face recognition system with extremely high requirements on safety is deployed in a financial scene or a face recognition door lock scene.

Disclosure of Invention

The embodiment of the invention aims to provide a model training method, a face recognition method, an electronic device and a storage medium, which can effectively solve the problem that the current face recognition product adopts the same face recognition algorithm and cannot well solve the problem of high rejection rate or high false recognition rate caused by crowd category difference.

In order to solve the above technical problem, an embodiment of the present invention provides a model training method, including:

acquiring image samples containing faces under different crowd categories, and labeling category labels of the crowd categories to which the image samples belong;

performing original training on a pre-established face recognition model based on the triple sample constructed by the image sample to obtain a trained face recognition model;

taking a feature map output by any layer of the feature extraction network in the face recognition model as input, and additionally arranging a crowd category branch network to form an intermediate model; the output of the intermediate model comprises the human face characteristics output by the human face recognition model and the crowd category output by the crowd category branch network;

and training the crowd category branch network based on the image sample and the category label of the image sample to obtain the intermediate model after training.

The embodiment of the invention provides a face recognition method, which comprises the following steps:

performing face recognition on a face image to be recognized by using an intermediate model trained by the model training method to obtain face features output by the face recognition model in the intermediate model and a crowd category output by the crowd category branch network;

and comparing the face features output by the face recognition model with the face features in the registered feature library one by adopting a similarity threshold corresponding to the crowd category obtained by the intermediate model recognition in preset similarity thresholds, and determining the identity information of the face in the face image to be recognized.

An embodiment of the present invention also provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a model training method as described above, or a face recognition method as described above.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements a model training method as described above, or a face recognition method as described above.

Compared with the prior art, the method and the device have the advantages that image samples containing faces under different crowd categories are collected, and category labels of the crowd categories to which the image samples belong are labeled; performing original training on a pre-built face recognition model based on a triple sample constructed by an image sample to obtain a trained face recognition model; taking a feature map output by any layer of a feature extraction network in a face recognition model as input, and additionally arranging a crowd category branch network to form an intermediate model; the output of the intermediate model comprises the human face characteristics output by the human face recognition model and the crowd category output by the crowd category branch network; training a crowd category branch network based on the image sample and the category label of the image sample to obtain a trained intermediate model. The intermediate model trained in the scheme can simultaneously obtain the face characteristics of the face image to be recognized and the crowd category to which the face to be recognized belongs. During subsequent feature comparison, a similarity threshold corresponding to the identified crowd category in preset similarity thresholds can be adopted to perform similarity comparison on the face features output by the face identification model and the face features in the registered feature library one by one to determine the identity information of the face in the face image to be identified, so that the problems of high rejection rate or high false identification rate caused by crowd category difference are well solved, and the accuracy of face identification is improved.

Drawings

FIG. 1 is a first flowchart illustrating a first embodiment of a model training method according to the present invention;

FIG. 2 is a schematic structural diagram of an intermediate model according to an embodiment of the invention;

FIG. 3 is a detailed flowchart II of a model training method according to an embodiment of the present invention;

FIG. 4 is a detailed flow chart of a face recognition method according to an embodiment of the invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

An embodiment of the present invention relates to a model training method, and as shown in fig. 1, the model training method provided in this embodiment includes the following steps.

Step 101: the method comprises the steps of collecting image samples containing faces under different crowd categories, and labeling category labels of the crowd categories to which the image samples belong.

The different crowd categories are a plurality of crowd categories divided from a certain dimension, such as a plurality of age categories divided by age or a plurality of race categories divided by race.

Specifically, image samples of faces under different crowd categories divided according to the certain dimension are collected, and category labels of the crowd categories to which the image samples belong are labeled. For example, three age categories may be divided by the age group to which the face belongs: children, adults, the elderly; the corresponding category labels are 0, 1, 2, where 0 represents child, 1 represents adult, and 2 represents elderly. For another example, the following race categories may be classified by the race to which the face belongs: yellows, caucasians, melanophors … …; the corresponding class labels are 0, 1, 2, … …, where 0 represents the yellow race, 1 represents the white race, 2 represents the black race, … ….

In order to avoid the bias of the face recognition model obtained by training (for example, the face recognition model has better robustness for the middle-aged population and has poorer robustness for the elderly and children population), the existing face recognition model is used as a pre-training model, a triple loss sampling strategy is modified, and the overall data balance is kept. In one example, the process of acquiring image samples containing faces in different demographic categories may satisfy the following acquisition strategy.

In any sampled image sample: the image samples including all the crowd categories and all the crowd categories are balanced in number; the number of the image samples of different persons is balanced; and the image samples of the same person are balanced in the number of the image samples in the preset multiple application scenes.

The image sample quantity balance of all the crowd categories is that the quantity of the image samples sampled each time is balanced in the whole dimension (such as age dimension and race dimension). For example, for the age dimension, the number of samples for each age category is required to be balanced or fit a normal distribution (e.g., child: adult: senior =2:6: 2).

The number balance of the image samples belonging to different people means that the number of the image samples correspondingly sampled by different people is totally balanced so as to avoid the long tail effect of data.

The number of image samples of the same person in a plurality of preset application scenes is balanced, and it is required to ensure that image data of each person in each scene (such as an indoor scene, an outdoor scene and the like) can be uniformly acquired in one-time sampling, so that the application scenes are balanced as much as possible.

Step 102: and performing original training on the pre-established face recognition model based on the triple samples constructed by the image samples to obtain the trained face recognition model.

In particular, a triplet sample is constructed based on the acquired image samples. In general, a generic triplet sample includes: anchor (a), positive (p), and negative (n). Wherein a and p are the same type of sample (corresponding to the same person); a and n are different classes of samples (corresponding to different people). When the face recognition model training is carried out based on the triple sample, the triple loss is calculated by adopting the following triple loss functionL。

………………………（1）

Wherein the content of the first and second substances,

is as followsiAnchor samples in the triplet image samples,

Is as followsiPositive samples of the triplet image samples,

Is as followsiNegative samples of the triplet image samples,fCalculated by face recognition modelA characteristic vector,NFor the total number of triplet image samples,

Is a measure of Euclidean distance between the positive and anchor samples,

Is the Euclidean distance measure between the negative and anchor samples;marginis a spacing parameter, "+" denotes "", "" is a]"when the value in is greater than zero, the value is taken as loss; when less than zero, the loss is 0.

Here, the cosine distance may be used instead of the euclidean distance, and the overall difference is not large.

In the step, a basic triplet loss function is used for training a pre-built face recognition model, the training is stopped after convergence, and the trained face recognition model is obtained on the basis. In order to distinguish from the subsequent process of training the face recognition model, the training process in this step is called an original training process.

Step 103: taking a feature map output by any layer of a feature extraction network in a face recognition model as input, and additionally arranging a crowd category branch network to form an intermediate model; the output of the intermediate model comprises the face characteristics output by the face recognition model and the crowd category output by the crowd category branch network.

Specifically, in this embodiment, a feature map output at any layer in a feature extraction network in the face recognition model is used as an input, a crowd category branch network is added, and the face recognition model after the crowd category branch network is added is recorded as an intermediate model. The feature extraction network is mainly responsible for extracting feature images of the face image at different depth levels during face image recognition. For example, the feature extraction network may be a convolutional neural network including a plurality of convolutional layers, and a feature map output by any one of the plurality of convolutional layers (preferably, a convolutional layer in a non-head-tail position) is used as an input of the crowd category branch network. The output of the crowd category branch network is a plurality of crowd categories, and the crowd category branch network is used for carrying out crowd category division on the face features output by the face recognition model to obtain the crowd categories to which the face features belong. Therefore, the output of the intermediate model comprises two parts, namely a new human face feature output by the human face recognition model and a crowd category output by the crowd category branch network.

In an example, the face recognition model may adopt a residual error network ResNet50 structure, and accordingly, a crowd category branch network is added by taking a feature map output by any layer in a feature extraction network in the face recognition model as an input, and a process of forming an intermediate model may be implemented by the following steps.

Step 1: and selecting an output characteristic diagram of the 2 nd residual block of the conv5_ x layer in the residual network ResNet50 structure as an input of the crowd category branch network.

Specifically, as shown in fig. 2, a structure diagram of the intermediate model exemplarily given in this embodiment is shown, and in practical application, the structure of the intermediate model may also be flexibly set according to actual requirements. In FIG. 2, in the left backbone networkxThe network structure from (face image) to Feature (face Feature vector) is a face recognition model, and a ResNet50 structure is adopted, and comprises: the multilayer structure comprises a Conv5_ x layer, a first 1x1 convolutional layer (Conv _1x 1), a second Global Pooling layer (Global Pooling) and a second full connection layer (FC) which are sequentially connected in series. And taking the feature map output by the 2 nd Residual Block (Residual Block _ 1) of the conv5_ x layer as the input of the crowd category branch network. The signature size at this time is 7x7, Channel = 512.

Step 2: constructing a crowd category branch network by adopting a residual block, a first global pooling layer and a first full-connection layer which are connected in series from front to back; the input of the residual block is used as the input of the crowd category branch network, and the output of the first full connection layer is used as the output of the crowd category branch network.

As shown in fig. 2, the crowd category branching network includes: the Residual Block (Residual Block _ agent) comprises a first Global Pooling layer (Global Pooling) and a first full connection layer (FC) which are sequentially connected in series. In practical application, the structure of the crowd category branch network can be flexibly set according to actual requirements.

Specifically, a feature map output by the 2 nd Residual Block (Residual Block _ 1) of the conv5_ x layer is used as an input of the crowd category branch network, and the feature map size is 7x7 at this time, and Channel = 512; after entering the crowd category branch network, a new Residual _ Block structure (Residual-Block _ Age) is connected, and an output channel (channel) of the Residual _ Block structure can be set to be 128 (can be adjusted according to actual conditions); then, next to a global pooling layer, changing the feature map from 7x7 to 1x 1; and finally, a full connection layer (FC) with the output dimension of the crowd category number is used as the output of the crowd category branch network. Shown in fig. 2 is a network of crowd category branches built for the age group dimension, so the output dimension is 3 (children, adults, elderly).

Step 104: training a crowd category branch network based on the image sample and the category label of the image sample to obtain a trained intermediate model.

Specifically, since the training of the newly added crowd category branch network is simple, the face recognition model trained in step 102, that is, the trunk network, is only required to be frozen, and then the newly added branch portion is trained, and softmax loss training is used in the training process. Because the number of the crowd categories is relatively small, the convergence speed is high, and the training of the crowd category branch network is finished after the crowd categories are completely converged, so that the intermediate model after the training is finished is finally obtained.

Compared with the related art, the embodiment collects the image samples containing the faces under different crowd categories, and labels the category labels of the crowd categories to which the image samples belong; performing original training on a pre-built face recognition model based on a triple sample constructed by an image sample to obtain a trained face recognition model; taking a feature map output by any layer of a feature extraction network in a face recognition model as input, and additionally arranging a crowd category branch network to form an intermediate model; the output of the intermediate model comprises the human face characteristics output by the human face recognition model and the crowd category output by the crowd category branch network; training a crowd category branch network based on the image sample and the category label of the image sample to obtain a trained intermediate model. The intermediate model trained in the scheme can simultaneously obtain the face characteristics of the face image to be recognized and the crowd category to which the face to be recognized belongs. During subsequent feature comparison, a similarity threshold corresponding to the identified crowd category in preset similarity thresholds can be adopted to perform similarity comparison on the face features output by the face identification model and the face features in the registered feature library one by one to determine the identity information of the face in the face image to be identified, so that the problems of high rejection rate or high false identification rate caused by crowd category difference are well solved, and the accuracy of face identification is improved.

Another embodiment of the invention relates to a model training method, as shown in fig. 3, which is an improvement over the method steps shown in fig. 1 in that the originally trained human recognition model is optimally trained to obtain an optimized intermediate model. As shown in fig. 3, after step 102, the following steps are also included.

Step 105: and calculating the triple loss of the triple image samples according to the face recognition model obtained in the original training, and extracting the difficult samples from the triple image samples according to the calculation result.

In the actual model training process, if the weight type recognition model is used, the face recognition model obtained through the original training can be well distinguished in the feature space according to different crowd categories due to the fact that the model capacity of the weight type model is large. However, if the application scene mainly includes a lightweight model or has a high face recognition security requirement, the model recognition capability needs to be further improved. In this embodiment, the human recognition model obtained by the original training is optimally trained to improve the model recognition capability.

Specifically, the triple image samples are input into a face recognition model obtained during original training, and feature vectors corresponding to the image samples in the triple image samples are output. The triplet loss is then calculated by the triplet loss function as shown in equation (1)L. When in useLIf the sample number is greater than 0, a loss will occur, and the corresponding triple sample can be regarded as trappedA difficult sample; when in useLWhen the loss is not greater than 0, the loss is 0, and the corresponding triple sample can be regarded as a simple sample.

However, in actual model training, more desirable goals are: the feature vectors corresponding to different crowd categories are differentiated; the feature vectors corresponding to different people in the same crowd category are also distinguishable.

For this reason, the present embodiment improves the conventional triplet loss function to achieve the above two discriminations as much as possible. Accordingly, this step 105 can be realized by the following steps.

Step 1: calculating a triplet loss for each triplet image sample by the following equation (2)L：

………………………（2）

When in use

And

when belonging to the same crowd category, the user can select the crowd,M=margin1；

when in use

And

when the people belong to different crowd categories, the people can be classified into different groups,M=margin2；

wherein the content of the first and second substances,

is as followsiAnchor samples in the triplet image samples,

Is as followsiPositive samples of the triplet image samples,

Is as followsiNegative samples of the triplet image samples,f(X) is an image sample, a feature vector obtained by calculation of the face recognition model,NFor the total number of triplet image samples,

Is a measure of Euclidean distance between the positive and anchor samples,

Is the Euclidean distance measure between the negative and anchor samples;M、 margin1、margin2are all interval parameters, andmargin2is greater thanmargin1。

Specifically, in calculatingiInterval parameter when a triplet of a triplet image sample is lostMIs based on positive samples

And negative sample

To the category of the population to which it belongs, i.e. the current sample

And negative sample

When the same group of people is in different categories, the value ismargin1(ii) a When the sample is positive

And negative sample

When the different groups are different types of different groups of people, the value ismargin2And is andmargin2is greater thanmargin1To ensure the time of the positive and negative samples in different crowd categoriesThe interval is larger than the interval of the same crowd category, and the actual requirement is met.

Step 2: extracting triple lossesLA triplet image sample larger than 0 is taken as a difficult sample.

Specifically, the triplet loss for each triplet image sample is calculated according to equation (2)LAnd extracting the triple loss thereinLA triplet image sample larger than 0 is taken as a difficult sample.

In one example, triple losses may be determined firstLTriplet image samples greater than 0; then extracting the three-element image sample from the determined three-element image sampleM=margin1Corresponding first difficult sample andM=margin2and taking the corresponding second difficult sample as the finally extracted difficult sample, wherein the ratio of the number of the first difficult sample to the second difficult sample is 1: 2.

Specifically, since each image sample is trained indiscriminately in step 102, further sampling is required for the difficult samples mined according to the condition of step 2. According to the experiment, the setting is usedmargin2The difficult sample dug out is more than usedmargin1And (4) digging out difficult samples, so that the training effect of the final model is better. If the ratio is set to 2:1, it indicates that in the present triplet loss sampling, the number of difficult samples in different classes for which the anchor sample and the negative sample are in different crowd categories is 2 times the number of difficult samples in different classes for which the anchor sample and the negative sample are in the same crowd category.

Step 106: retraining the face recognition model obtained after the original training based on the difficult sample to obtain an optimized face recognition model; and the interval parameter used for calculating the triple loss during the retraining is smaller than the interval parameter used for calculating the triple loss during the original training.

Specifically, the training process of retraining in this step is the same as the process of originally training the face recognition model, and the differences are only: the image samples used during retraining are all the difficult samples obtained in step 105, and the interval parameters used for calculating the triplet loss during retraining are smaller than the interval parameters used for calculating the triplet loss during the original training, because the training on the difficult samples is more strict than that on the ordinary samples.

On this basis, step 103 may accordingly comprise the following sub-steps.

Substep 1031: and (3) taking a feature map output by any layer in the feature extraction network in the optimized face recognition model as input, and additionally arranging a crowd category branch network to form an intermediate model.

Compared with the related art, the embodiment calculates the triple loss of the triple image samples by aiming at the face recognition model obtained in the original training, and extracts the difficult samples from the triple image samples according to the calculation result; retraining the face recognition model obtained after the original training based on the difficult sample to obtain an optimized face recognition model; and the interval parameter used for calculating the triple loss during retraining is smaller than the interval parameter used for calculating the triple loss during original training, so that the recognition capability of the face recognition model is improved.

Another embodiment of the present invention relates to a face recognition method, as shown in fig. 4, which includes the following steps.

Step 201: and performing face recognition on the face image to be recognized by using the intermediate model trained by the model training method to obtain the face features output by the face recognition model in the intermediate model and the crowd categories output by the crowd category branch network.

Specifically, by using the intermediate model obtained in the above method embodiment, the face image to be recognized is subjected to face recognition, and two outputs, namely, a new face feature output by the face recognition model in the intermediate model and a crowd category output by the crowd category branch network, can be obtained.

Step 202: and performing similarity comparison on the face features output by the face recognition model and the face features in the registered feature library one by adopting a similarity threshold corresponding to the crowd category obtained by the intermediate model recognition in the preset similarity thresholds, and determining the identity information of the face in the face image to be recognized.

Specifically, according to the crowd category output by the crowd category branch network in the intermediate model, a similarity threshold corresponding to the crowd category is extracted from preset similarity thresholds. When the face features output by the face recognition model are subjected to similarity comparison with the face features in the registered feature library one by one, whether the two face features are the same person can be judged based on the extracted similarity threshold. For example, when the compared similarity value is greater than the similarity threshold value, it is determined that the two face features correspond to the same person, otherwise, it is determined that the two face features are not the same person.

Compared with the related art, the intermediate model obtained by the training can simultaneously obtain the face features of the face image to be recognized and the crowd category to which the face to be recognized belongs. During subsequent feature comparison, a similarity threshold corresponding to the identified crowd category in preset similarity thresholds can be adopted to perform similarity comparison on the face features output by the face identification model and the face features in the registered feature library one by one to determine the identity information of the face in the face image to be identified, so that the problems of high rejection rate or high false identification rate caused by crowd category difference are well solved, and the accuracy of face identification is improved.

Another embodiment of the invention relates to an electronic device, as shown in FIG. 5, comprising at least one processor 302; and a memory 301 communicatively coupled to the at least one processor 302; the memory 301 stores instructions executable by the at least one processor 302, and the instructions are executed by the at least one processor 302 to enable the at least one processor 302 to perform any of the method embodiments described above.

Where the memory 301 and processor 302 are coupled in a bus, the bus may comprise any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 302 and memory 301 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 302 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 302.

The processor 302 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 301 may be used to store data used by processor 302 in performing operations.

Another embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes any of the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A method of model training, comprising:

2. The method of claim 1, wherein the crowd categories are age categories divided by age or ethnicity categories divided by ethnicity; the process of acquiring image samples containing faces under different crowd categories meets the following acquisition strategies:

in any sampled image sample: the image samples which comprise all the crowd categories and are in each crowd category are balanced in number; the number of the image samples of different persons is balanced; and the image samples of the same person are balanced in the number of the image samples in the preset multiple application scenes.

3. The method according to claim 1, wherein the original training of the pre-constructed face recognition model based on the triple sample constructed by the image sample to obtain the trained face recognition model comprises:

calculating the triple loss of the triple image samples according to the face recognition model obtained in the original training, and extracting difficult samples from the triple image samples according to the calculation result;

retraining the face recognition model obtained after the original training based on the difficult sample to obtain an optimized face recognition model; the interval parameter used for calculating the triple loss during retraining is smaller than the interval parameter used for calculating the triple loss during original training;

the method for forming the intermediate model by adding the crowd classification branch network by taking the feature graph output by any layer in the feature extraction network in the face recognition model as input comprises the following steps:

and adding the crowd category branch network by taking the feature map output by any layer of the feature extraction network in the optimized face recognition model as input to form the intermediate model.

4. The method of claim 3, wherein the calculating the triplet loss of the triplet image samples for the face recognition model obtained during the original training and extracting the difficult samples from the triplet image samples according to the calculation result comprises:

calculating a triplet loss for each of the triplet image samples by the following formulaL：

When in use

And

when in use

And

wherein the content of the first and second substances,

is as followsiAnchor samples in the triplet image samples,

Is as followsiPositive samples of the triplet image samples,

Is a measure of Euclidean distance between the positive and anchor samples,

Is the Euclidean distance measure between the negative and anchor samples;M、 margin1、margin2are all interval parameters, andmargin2is greater thanmargin1；

Extracting the triple lossLA triplet image sample greater than 0 is taken as the difficult sample.

5. The method of claim 4, wherein the extracting the triplet of lossesLTriplet image samples greater than 0 as the difficult samples include:

determining the triplet lossLTriplet image samples greater than 0;

extracting from the determined triplet image sampleM=margin1Corresponding first difficult sample andM=margin2and taking a corresponding second difficult sample as the difficult sample finally extracted, wherein the ratio of the number of the first difficult sample to the second difficult sample is 1: 2.

6. The method of claim 1, wherein the face recognition model adopts a residual error network ResNet50 structure, and the adding a crowd category branch network with a feature map output from any layer of a feature extraction network in the face recognition model as an input to form an intermediate model comprises:

selecting an output characteristic diagram of a 2 nd residual block of a conv5_ x layer in the residual network ResNet50 structure as the input of the crowd category branch network;

constructing the crowd category branch network by adopting a residual block, a first global pooling layer and a first full-connection layer which are connected in series from front to back; the input of the residual block is used as the input of the crowd category branching network, and the output of the first full connection layer is used as the output of the crowd category branching network.

7. The method of claim 6, wherein the ResNet50 structure comprises: the global pooling layer comprises a conv5_ x layer, a first 1x1 convolutional layer, a second global pooling layer and a second full-connection layer which are sequentially connected in series.

8. A face recognition method, comprising:

performing face recognition on a face image to be recognized by using an intermediate model trained by the model training method according to any one of claims 1 to 7 to obtain face features output by the face recognition model in the intermediate model and a crowd category output by the crowd category branch network;

9. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a model training method as claimed in any one of claims 1 to 7, or a face recognition method as claimed in claim 8.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the model training method of any one of claims 1 to 7 or the face recognition method of claim 8.