CN116563926B

CN116563926B - Face recognition method, system, equipment and computer readable storage medium

Info

Publication number: CN116563926B
Application number: CN202310558627.8A
Authority: CN
Inventors: 刘伟华; 左勇; 林超超; 罗艳
Original assignee: Athena Eyes Co Ltd
Current assignee: Athena Eyes Co Ltd
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2024-03-01
Anticipated expiration: 2043-05-17
Also published as: CN116563926A

Abstract

The application discloses a face recognition method, a face recognition system, face recognition equipment and a computer-readable storage medium, wherein a target face image is acquired; performing face alignment on the target face image to obtain a target processing image; performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result; the mask guiding and merging attention network is connected with the mask guiding and merging attention network and is used for obtaining a face recognition result. The face recognition method and device can automatically position shielding in the face image and eliminate influence of the shielding, and face recognition can be accurately carried out.

Description

Face recognition method, system, equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer vision, and more particularly, to a face recognition method, system, device, and computer readable storage medium.

Background

With the development of deep learning, the face recognition method without shielding has been remarkably successful in the past few years. Thanks to the carefully designed loss function, convolutional neural network architecture (Convolutional Neural Networks, CNNs) and large-scale face recognition training data set, the existing scheme achieves enough accuracy and the like in non-shielding application scenes such as entrance guard, attendance checking, mobile payment and the like.

In recent years, face recognition technology has been rapidly developed and applied, solving a number of practical problems. Occlusion in face recognition is considered one of the major challenges. To handle occlusion, there are two main directions: a) Restoring the occluded features and b) removing the occluded features. The method of the former is to recover the complete face from the blocked face, so that the traditional face recognition algorithm can be directly applied. The latter approach reduces the impact of occlusion on the recognition result by explicitly excluding occlusion regions.

However, the scheme of restoring the occluded features may result in loss of identity information or errors due to the composition of the occluded areas, thereby reducing the results. The solution of removing occluded features requires deleting occluded corrupted features, identifying with the remaining non-occluded corrupted features, still has difficulty ignoring the occlusion because it is often difficult to locate the occlusion directly, and has to take expensive time to train additional occlusion detectors. Furthermore, these networks designed for occlusion, while performing slightly better than other models, are often difficult to trust and not robust because they require a trade-off between occlusion and non-occlusion performance, i.e., it is difficult to accurately occlude face recognition without affecting non-occlusion face recognition. The performance of the non-occlusion face recognition model on the occlusion face is obviously reduced due to inconsistent information caused by the damaged features. However, forcing attention to corrupted features will result in degraded performance of the occlusion recognition network in non-occlusion recognition.

In summary, how to accurately identify the face image is a problem to be solved by those skilled in the art.

Disclosure of Invention

The purpose of the application is to provide a face recognition method which can solve the technical problem of how to accurately recognize face images to a certain extent. The application also provides a face recognition system, a device and a computer readable storage medium.

In order to achieve the above object, the present application provides the following technical solutions:

a face recognition method, comprising:

acquiring a target face image to be identified;

performing face alignment on the target face image to obtain a target processing image;

performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result;

the mask guiding fusion attention network is connected with the mask guiding fusion attention network and is used for obtaining the face recognition result.

Preferably, the feature extraction backbone network comprises a linear embedded layer, a first swin transformer layer connected with the linear embedded layer, and a preset number of feature extraction layers connected with the first swin transformer layer in sequence;

the feature extraction layer comprises a first block merging layer and a second swin transformer layer which are connected in sequence.

Preferably, the occlusion segmentation branch network comprises a preset number of segmentation extraction layers, a first interpolation layer and a linear projection layer which are sequentially connected with the output of the feature extraction backbone network;

the segmentation extraction layer comprises a block expansion layer, a third swin transformer layer and a first splicing layer which are sequentially connected, wherein the first splicing layer is used for splicing input features through jump connection and outputting the spliced features; and the first splicing layer of the ith split extraction layer connected with the output of the feature extraction backbone network is connected with the input of the (n+1-i) th feature extraction layer, wherein n represents the preset number of values.

Preferably, the Mask-guided Attention (MGA) layer includes a preset number of Attention-Fused (FA) layers, which are sequentially connected to the output of the feature extraction backbone network, and the MGA layer is connected to the output of the occlusion split branching network;

the FA layer comprises a block expansion layer, a second splicing layer connected with the block expansion layer and an output layer of the upper layer, and a convolution dimension reduction layer connected with the second splicing layer; the block expansion layer of the ith FA layer is connected with the first splicing layer of the ith segmentation extraction layer; the output of the upper layer of the first FA layer is the output of the feature extraction backbone network;

the MGA layer comprises a second interpolation layer connected with the output of the shielding and dividing branch network, an index operation layer connected with the output of the last FA layer, an element multiplication layer connected with the second interpolation layer and the index operation layer, a second block merging layer, a fourth swin transformer layer and an average pooling layer which are sequentially connected with the element multiplication layer.

Preferably, the identification network comprises an MLP network.

Preferably, the loss function of the occlusion aware guide network includes:

L _overall ＝αL _seg +βL _cls ；

wherein L is _overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) _seg Representing a loss of the occlusion splitting branch network; l (L) _cls Representing a loss of the identification network;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ _j Representing the weight W _j And predictive feature x _i An angle therebetween; />Representing the weight W _j And true feature y _i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x _i Is a length of (c).

Preferably, the loss function of the occlusion aware guide network includes:

L _overall ＝αL _seg +βL _cls +λφ _fusion ；

wherein L is _overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) _seg Representing a loss of the occlusion splitting branch network; l (L) _cls Representing a loss of the identification network; phi (phi) _fusion Representation ofFusion constraints with adaptive weights λ;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ _j Representing the weight W _j And predictive feature x _i An angle therebetween; />Representing the weight W _j And true feature y _i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x _i Is a length of (2); />Representing the Euclidean distance; f (F) _att Representing the non-occlusion feature attention pattern of the synthesized occlusion face image output by the MGA layer; f (F) _a ' _tt A attention feature map representing the non-occlusion face image output by the MGA layer, and F _a ' _tt Occlusion features in the generation process come from F _att Occlusion features in the generation process.

A face recognition system, comprising:

the first acquisition module is used for acquiring a target face image to be identified;

the first alignment module is used for carrying out face alignment on the target face image to obtain a target processing image;

the first recognition module is used for recognizing the face of the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result;

A face recognition device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the face recognition method as described in any one of the above when executing the computer program.

A computer readable storage medium having stored therein a computer program which when executed by a processor performs the steps of the face recognition method as claimed in any one of the preceding claims.

According to the face recognition method, a target face image to be recognized is obtained; performing face alignment on the target face image to obtain a target processing image; performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result; the mask guiding and merging attention network is connected with the mask guiding and merging attention network and is used for obtaining a face recognition result. The shielding perception guiding network can automatically position various shielding in the face image and eliminate the influence of damage features, meanwhile, the non-shielding features are not influenced, and the face image can be accurately identified. The face recognition system, the face recognition device and the computer readable storage medium also solve the corresponding technical problems.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

Fig. 1 is a flowchart of a face recognition method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of the structure of an occlusion aware bootstrap network;

FIG. 3 is a schematic diagram of a non-occlusion region of a split branch prediction;

FIG. 4 is phi _fusion Is a calculation schematic diagram of (1);

FIG. 5 is a schematic diagram of the structure of a swin transducer layer;

fig. 6 is a flowchart of a face recognition system according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a face recognition device according to an embodiment of the present application;

fig. 8 is another schematic structural diagram of a face recognition device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Referring to fig. 1, fig. 1 is a flowchart of a face recognition method according to an embodiment of the present application.

The face recognition method provided by the embodiment of the application can comprise the following steps:

step S101: and acquiring a target face image to be identified.

In practical application, a target face image to be identified can be acquired first, and information of the target face image can be determined according to practical needs, for example, the target face image can be a face image with a mask or a face image with glasses.

Step S102: and carrying out face alignment on the target face image to obtain a target processing image.

In practical application, after the target face image to be identified is obtained, face alignment can be performed on the target face image to obtain a target processing image, specifically, five landmarks (two eyes, nose and two mouth corners of each face image) of the target face image can be detected based on a standard multitasking convolutional neural network (Multi-Task Convolutional Neural Network, MTCNN) to correctly align and clip the face image, and the face image can be scaled to obtain a face image with a size of 112×112. In addition, in order to facilitate the subsequent processing, the pixel values in the target processing image may be normalized to [ -1.0,1.0] or the like.

Step S103: performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result; the mask guiding and merging attention network is connected with the mask guiding and merging attention network and is used for obtaining a face recognition result.

In practical application, after the target processing image is obtained, the face recognition can be performed on the target image based on the shielding perception guiding network provided by the application, so that a face recognition result is obtained. Because the core idea of the application is to automatically focus on undamaged features outside shielding to guide visible face feature learning, in order to realize the idea, shielding segmentation branches are added and shielding mask supervision signals are introduced to guide the architecture to focus on non-shielding features. For this reason, an occlusion aware guidance network MGFace for face recognition proposed in the present application has novel occlusion segmentation branches and mask guidance fusion attention. The MGFace consists of three key networks, namely a feature extraction backbone network, an occlusion split branch network and a mask-guided fused attention network, and the structure of the MGFace can be shown in fig. 2.

In a specific application scenario, global information is essential due to occlusion diversity. However, the inherent locality of convolution kernels makes it difficult to capture long-range dependencies that can be effectively captured by self-attention mechanisms. In order to explore the difference between occlusion pixels and face background, the feature extraction backbone network of the present application is implemented by a swin transformer, which is the most recent variant of a widely used transformer, specifically, MGFace obtains a small batch of randomly occluded or non-occluded facial images as input, and the feature extraction backbone network obtains self-attention features through swin-transformers, so that the learning process of occlusion segmentation and face recognition is more stable, and a more accurate mask is obtained to produce better accuracy. The feature extraction backbone network is a standard swin transformer, first cuts the picture into several tiles, and embeds each tile into a fixed length vector representation using a Linear Embedding layer (Linear Embedding). To generate the hierarchical representation, the size is reduced by the tile merge layer (Patch merge) and the feature dimension is increased to 2 x the original dimension (equivalent to 2 times downsampling). The feature extraction backbone network is used for extracting a shared self-attention feature map for subsequent final targets of occlusion segmentation branching and face recognition. Unlike conventional transformers, which take into account the dimensional diversity of visual elements, which use Multi-headed self-attention (Multi-Task Self Attention, MSA) modules, swin transformers are built based on Sliding Windows (SW), which may include window-based Multi-headed self-attention mechanisms (W-MSA) and sliding window-based Multi-headed self-attention (SW-MSA) modules. In other words, the feature extraction backbone network in the present application includes a linear embedded layer, a first swin transformer layer connected to the linear embedded layer, and a preset number of feature extraction layers sequentially connected to the first swin transformer layer; the feature extraction layer comprises a first block merging layer and a second swin transformer layer which are connected in sequence.

In a specific application scenario, the characteristics of the occlusion areas are destroyed by occlusion, so that the areas cannot provide effective face information for face recognition. In order to extract deep features from non-occlusion information areas, the occlusion segmentation branches perform non-occlusion region segmentation tasks to guide the architecture to focus on non-occlusion features. And extracting a feature map based on the shared feature extraction backbone network with the swin transformer, and reflecting the probability that each pixel belongs to a non-occlusion region by decoding after the segmentation branch network is occluded. Similar to the U-Net encoder-decoder architecture, the occlusion partition branches consist essentially of a tile extension layer (Patch extension) and a switch transform block. The tile extension layer is used in the occlusion branches to upsample the extracted depth features and then flow into the swin transformer block. Like U-Net, the jump connection is also used to fuse multi-scale features from the feature extraction backbone network. The shallow layer features and the deep layer features are connected together through jump connection, so that the loss of spatial information caused by downsampling is reduced. Finally, the output is mapped to a non-occlusion feature attention mask by interpolation operations (interpolation) and linear Projection layers (line Projection). In other words, the occlusion segmentation branch network comprises a preset number of segmentation extraction layers, a first interpolation layer and a linear projection layer which are sequentially connected with the output of the feature extraction backbone network; the segmentation extraction layer comprises a block expansion layer, a third swin transformer layer and a first splicing layer which are sequentially connected, wherein the first splicing layer is used for splicing input features through jump connection and outputting the spliced features; and a first splice layer of an ith split extraction layer connected to an output of the feature extraction backbone network is connected to an input of an (n+1-i) th feature extraction layer, n representing a preset number of values. The predicted occlusion segmentation map of the occlusion segmentation branch network on the input face image can be as shown in fig. 3, etc.

In a specific application scene, in order to fuse local mask guiding attention information of non-shielding features under different scales, the application provides a mask guiding and fusing attention network, which integrates global interaction among different scales and enhances the denaturation such as translation of MGface. The Mask-guided Attention network is composed of two parts, fusion Attention (FA) and Mask-guided Attention (MGA). For the kth stageSelf-attention feature map F of (1) ^k The tile extension layer in the FA is used for up-sampling firstly, then the tile extension layer is spliced with the result of the swin transducer block with the same size from the shielding segmentation branch, and the spliced result Y is obtained ^k Self-attention feature map F for the (k+1) stage is reduced to be a vitamin by 1X 1 convolution ^k+1 . After multiple FA fusions, the self-attention feature map F is upsampled to by interpolation in MGAAnd is in close relation to the shielding division mask>Having the same size. Then, the shielding division mask with the shielding region suppressed +.>An index process is carried out, and the index is processed,is non-occlusion feature attention information and is associated with a self-attention feature map +.>The previous processing. With the exponential operation, the result of occlusion segmentation output can be kept to focus on non-occlusion features while reducing the negative effects of erroneous predictions, especially for incorrect zero output. Finally, final fusion feature F _att Obtaining a non-occlusion attention profile by swin transducer and averaging pooling>Information redundancy is reduced and overfitting is prevented. In other words, the mask-guided fused attention network in the present application includes a preset number of FA layers, MGA layers sequentially connected with the output of the feature extraction backbone network, and the MGA layers are connected with the output of the occlusion split branch network; wherein the FA layer comprises a block expansion layer, a second splicing layer connected with the block expansion layer and the output layer of the upper layer, and a second splicing layerA convolution dimension reduction layer connected with the connecting layer; the block expansion layer of the ith FA layer is connected with the first splicing layer of the ith segmentation extraction layer; the output of the upper layer of the first FA layer is used as the characteristic to extract the output of the backbone network; the MGA layer comprises a second interpolation layer connected with the output of the shielding and dividing branch network, an index operation layer connected with the output of the last FA layer, an element multiplication layer connected with the second interpolation layer and the index operation layer, a second block merging layer, a fourth swin transformer layer and an average pooling layer which are sequentially connected with the element multiplication layer.

In a specific application scenario, the recognition network may include an MLP network, and the face recognition result may include the identity of the face user and the like.

In a specific application scenario, the loss function of the occlusion awareness guidance network in the training process may include:

L _overall ＝αL _seg +βL _cls ；

wherein L is _overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) _seg Representing a loss of blocking the split branch network; l (L) _cls Representing a loss of the identification network;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ _j Representing the weight W _j And predictive feature x _i An angle therebetween; />Representing the weight W _j And true feature y _i Between (a) and (b)An angle; m represents an additional angular margin penalty; s represents the prediction feature x _i Is a length of (c).

In a specific application scenario, many high-performance face recognition models cannot recognize an occlusion face, so that it is difficult to generalize under both an unoccluded face and an occlusion face, because the occlusion face is regarded as an outlier in nature due to destruction of the occlusion. However, when samples of both occluded and unoccluded faces are used to train the network to generate discriminant feature embedding space, it is difficult to optimize the model to accommodate all sample types. For this purpose, the present application uses synthetic occlusion faces for training. The fusion attention module is guided based on shielding segmentation branches and masks, and fusion constraint phi is further introduced _fusion To reduce the interference of occlusion on the consistency of the characteristics of the occluded face and the non-occluded face, and to calculate phi by introducing the original non-occluded face before occlusion synthesis _fusion As shown in fig. 4, in other words, the loss function of the occlusion-aware guided network during training may include:

L _overall ＝αL _seg +βL _cls +λφ _fusion ；

wherein L is _overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) _seg Representing a loss of blocking the split branch network; l (L) _cls Representing a loss of the identification network; phi (phi) _fusion Representing a fusion constraint with an adaptive weight λ;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ _j Representing the weight W _j And predictive feature x _i An angle therebetween; />Representing the weight W _j And true feature y _i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x _i Is a length of (2); />Representing the Euclidean distance; f (F) _att Representing the non-occlusion feature attention pattern of the synthesized occlusion face image output by the MGA layer; f'. _att Attention characteristic diagram representing non-occlusion face image output by MGA layer and F' _att Occlusion features in the generation process come from F _att Occlusion features in the generation process.

F as shown in FIG. 4 _a ' _tt Is a attention feature map calculated in a mask-guided fusion attention module, Y ^k Andmasks from uncorrupted regions in the composite occlusion face segmented by occlusion segmentation branches, which masks are shared with the original non-occlusion face. F if the sample is not enhanced by synthetic occlusion _att -F’ _att Then it will be zero which will not affect the optimisation of the network. By minimizing L ₂ The distance between the characteristic attention information of the blocked face and the non-blocked face is fused with constraint phi _fusion Encouraging models to predict non-changing when occluding and non-occluding faces helps generalize. These objectives should be balanced in order to obtain satisfactory face recognition results. In the initial learning phase, the present application focuses more on occlusion segmentation and face recognition, as this is a prerequisite. When the progress is smooth, the application turns the learning center to the fusion constraint phi _fusion This focuses the architecture on reducing unobscured faces and occluded facesSimilarity distance between the two. That is, lambda is almost 0 at the beginning, then, when L _seg And L _cls Oscillation causes a varying amplitude L _seg And L _cls In a very small enough iteration, λ is increased to a relatively high value, eventually completing the training of the model.

In a specific application scenario, the structures of the first swin transformer layer, the second swin transformer layer, the third swin transformer layer, and the fourth swin transformer layer may be as shown in fig. 5, and a pair of consecutive swin transformer blocks may be expressed as:

wherein,and z ^l First swin transducer block representing the outputs of the W-MSA module and MLP, respectively, ">And z ^l ⁺¹ Represents the (l+1) th swin transducer block of the SW-MSA module and the MLP output. And consistent with the transducer, the swin transducer self-attention may be calculated as follows:

wherein Q represents a query matrix; k represents a key matrix; v represents a matrix of values; d represents the dimension of the key matrix; b represents a bias matrix; t represents the transpose of the matrix.

According to the face recognition method, a target face image to be recognized is obtained; performing face alignment on the target face image to obtain a target processing image; performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result; the mask guiding and merging attention network is connected with the mask guiding and merging attention network and is used for obtaining a face recognition result. The shielding perception guiding network can automatically position various shielding in the face image and eliminate the influence of damage features, meanwhile, the non-shielding features are not influenced, and the face image can be accurately identified.

Referring to fig. 6, fig. 6 is a flowchart of a face recognition system according to an embodiment of the present application.

The face recognition system provided in the embodiment of the application may include:

an acquisition module 101, configured to acquire a target face image to be identified;

the alignment module 102 is configured to perform face alignment on a target face image to obtain a target processing image;

the recognition module 103 is used for recognizing the face of the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result;

the mask guiding and merging attention network is connected with the mask guiding and merging attention network and is used for obtaining a face recognition result.

The face recognition system provided by the embodiment of the application comprises a linear embedded layer, a first swin transformer layer connected with the linear embedded layer, and a preset number of feature extraction layers sequentially connected with the first swin transformer layer;

The face recognition system provided by the embodiment of the application comprises a preset number of segmentation extraction layers, a first interpolation layer and a linear projection layer which are sequentially connected with the output of a feature extraction backbone network;

the segmentation extraction layer comprises a block expansion layer, a third swin transformer layer and a first splicing layer which are sequentially connected, wherein the first splicing layer is used for splicing input features through jump connection and outputting the spliced features; and a first splice layer of an ith split extraction layer connected to an output of the feature extraction backbone network is connected to an input of an (n+1-i) th feature extraction layer, n representing a preset number of values.

According to the face recognition system provided by the embodiment of the application, the mask-guided fusion attention network comprises a preset number of FA layers and MGA layers which are sequentially connected with the output of the feature extraction backbone network, and the MGA layers are connected with the output of the shielding segmentation branch network;

the FA layer comprises a block expansion layer, a second splicing layer connected with the block expansion layer and an output layer of the upper layer, and a convolution dimension reduction layer connected with the second splicing layer; the block expansion layer of the ith FA layer is connected with the first splicing layer of the ith segmentation extraction layer; the output of the upper layer of the first FA layer is used as the characteristic to extract the output of the backbone network;

The face recognition system provided by the embodiment of the application, the recognition network comprises an MLP network.

The embodiment of the application provides a face recognition system, and the loss function of shielding perception guiding network includes:

L _overall ＝αL _seg +βL _cls ；

wherein L is _overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) _seg Representing a loss of blocking the split branch network; l (L) _cls Representing a loss of the identification network;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ _j Representing the weight W _j And predictive feature x _i An angle therebetween; />Representing the weight W _j And true feature y _i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x _i Is a length of (c).

L _overall ＝αL _seg +βL _cls +λφ _fusion ；

wherein L is _overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) _seg Representing a loss of blocking the split branch network; l (L) _cls Representing a loss of the identification network; phi (phi) _fusion Representing a fusion constraint with an adaptive weight λ;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ _j Representing the weight W _j And predictive feature x _i An angle therebetween; />Representing the weight W _j And true feature y _i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x _i Is a length of (2); />Representing the Euclidean distance; f (F) _att Representing the non-occlusion feature attention pattern of the synthesized occlusion face image output by the MGA layer; f (F) _a ' _tt Attention feature map representing non-occlusion face image output by MGA layer, and F _a ' _tt Occlusion features in the generation process come from F _att Occlusion features in the generation process.

The application also provides face recognition equipment and a computer readable storage medium, which have the corresponding effects of the face recognition method. Referring to fig. 7, fig. 7 is a schematic structural diagram of a face recognition device according to an embodiment of the present application.

The face recognition device provided in the embodiment of the present application includes a memory 201 and a processor 202, where the memory 201 stores a computer program, and the processor 202 implements the steps of the face recognition method described in any of the embodiments above when executing the computer program.

Referring to fig. 8, another face recognition device provided in an embodiment of the present application may further include: an input port 203 connected to the processor 202 for transmitting an externally input command to the processor 202; a display unit 204 connected to the processor 202, for displaying the processing result of the processor 202 to the outside; and the communication module 205 is connected with the processor 202 and is used for realizing communication between the face recognition device and the outside. The display unit 204 may be a display panel, a laser scanning display, or the like; communication means employed by the communication module 205 include, but are not limited to, mobile high definition link technology (HML), universal Serial Bus (USB), high Definition Multimedia Interface (HDMI), wireless connection: wireless fidelity (WiFi), bluetooth communication, bluetooth low energy communication, ieee802.11s based communication.

The embodiment of the application provides a computer readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the steps of the face recognition method described in any embodiment above are implemented.

The computer readable storage medium referred to in this application includes Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The description of the relevant parts in the face recognition system, the device and the computer readable storage medium provided in the embodiments of the present application refers to the detailed description of the corresponding parts in the face recognition method provided in the embodiments of the present application, and will not be repeated here. In addition, the parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of the corresponding technical solutions in the prior art, are not described in detail, so that redundant descriptions are avoided.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A face recognition method, comprising:

acquiring a target face image to be identified;

the mask guiding fusion attention network is connected with the mask guiding fusion attention network and is used for obtaining the face recognition result;

the feature extraction backbone network comprises a linear embedded layer, a first swin transformer layer connected with the linear embedded layer, and a preset number of feature extraction layers sequentially connected with the first swin transformer layer; the feature extraction layer comprises a first block merging layer and a second swin transformer layer which are connected in sequence;

the shielding segmentation branch network comprises a preset number of segmentation extraction layers, a first interpolation layer and a linear projection layer which are sequentially connected with the output of the characteristic extraction backbone network; the segmentation extraction layer comprises a block expansion layer, a third swin transformer layer and a first splicing layer which are sequentially connected, and the first splicing layer is used for splicing input features through jump connection and outputting the spliced features; the first splicing layer of the ith segmentation extraction layer connected with the output of the characteristic extraction backbone network is connected with the input of the (n+1-i) th characteristic extraction layer, and n represents the value of the preset quantity;

the mask-guided fused attention network comprises a preset number of fused attention layers FA and mask-guided attention layers MGA which are sequentially connected with the output of the feature extraction backbone network, and the MGA layers are connected with the output of the shielding segmentation branch network; the FA layer comprises a block expansion layer, a second splicing layer connected with the block expansion layer and an output layer of the upper layer, and a convolution dimension reduction layer connected with the second splicing layer; the block expansion layer of the ith FA layer is connected with the first splicing layer of the ith segmentation extraction layer; the output of the upper layer of the first FA layer is the output of the feature extraction backbone network; the MGA layer comprises a second interpolation layer connected with the output of the shielding and dividing branch network, an index operation layer connected with the output of the last FA layer, an element multiplication layer connected with the second interpolation layer and the index operation layer, a second block merging layer, a fourth swin transformer layer and an average pooling layer which are sequentially connected with the element multiplication layer.

2. The method of claim 1, wherein the identification network comprises an MLP network.

3. The method according to any of claims 1 to 2, wherein the loss function of the occlusion-aware bootstrap network comprises:

L _overall ＝αL _seg +βL _cls ；

4. The method according to any of claims 1 to 2, wherein the loss function of the occlusion-aware bootstrap network comprises:

L _overall ＝αL _seg +βL _cls +λφ _fusion ；

wherein L is _overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) _seg Representing a loss of the occlusion splitting branch network; l (L) _cls Representing a loss of the identification network; phi (phi) _fusion Representing a fusion constraint with an adaptive weight λ;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ _j Representing the weight W _j And predictive feature x _i An angle therebetween; />Representing the weight W _j And true feature y _i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x _i Is a length of (2); />Representing the Euclidean distance; f (F) _att Representing the non-occlusion feature attention pattern of the synthesized occlusion face image output by the MGA layer; f (F) _a ' _tt A attention feature map representing the non-occlusion face image output by the MGA layer, and F _a ' _tt Occlusion features in the generation process come from F _att Occlusion features in the generation process.

5. A face recognition system, comprising:

the acquisition module is used for acquiring a target face image to be identified;

the alignment module is used for carrying out face alignment on the target face image to obtain a target processing image;

the recognition module is used for recognizing the face of the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result;

6. A face recognition device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the face recognition method according to any one of claims 1 to 4 when executing the computer program.

7. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the steps of the face recognition method according to any one of claims 1 to 4.