CN116563926B - Face recognition method, system, equipment and computer readable storage medium - Google Patents

Face recognition method, system, equipment and computer readable storage medium Download PDF

Info

Publication number
CN116563926B
CN116563926B CN202310558627.8A CN202310558627A CN116563926B CN 116563926 B CN116563926 B CN 116563926B CN 202310558627 A CN202310558627 A CN 202310558627A CN 116563926 B CN116563926 B CN 116563926B
Authority
CN
China
Prior art keywords
layer
network
occlusion
face recognition
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310558627.8A
Other languages
Chinese (zh)
Other versions
CN116563926A (en
Inventor
刘伟华
左勇
林超超
罗艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Athena Eyes Co Ltd
Original Assignee
Athena Eyes Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Athena Eyes Co Ltd filed Critical Athena Eyes Co Ltd
Priority to CN202310558627.8A priority Critical patent/CN116563926B/en
Publication of CN116563926A publication Critical patent/CN116563926A/en
Application granted granted Critical
Publication of CN116563926B publication Critical patent/CN116563926B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a face recognition method, a face recognition system, face recognition equipment and a computer-readable storage medium, wherein a target face image is acquired; performing face alignment on the target face image to obtain a target processing image; performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result; the mask guiding and merging attention network is connected with the mask guiding and merging attention network and is used for obtaining a face recognition result. The face recognition method and device can automatically position shielding in the face image and eliminate influence of the shielding, and face recognition can be accurately carried out.

Description

Face recognition method, system, equipment and computer readable storage medium
Technical Field
The present application relates to the field of computer vision, and more particularly, to a face recognition method, system, device, and computer readable storage medium.
Background
With the development of deep learning, the face recognition method without shielding has been remarkably successful in the past few years. Thanks to the carefully designed loss function, convolutional neural network architecture (Convolutional Neural Networks, CNNs) and large-scale face recognition training data set, the existing scheme achieves enough accuracy and the like in non-shielding application scenes such as entrance guard, attendance checking, mobile payment and the like.
In recent years, face recognition technology has been rapidly developed and applied, solving a number of practical problems. Occlusion in face recognition is considered one of the major challenges. To handle occlusion, there are two main directions: a) Restoring the occluded features and b) removing the occluded features. The method of the former is to recover the complete face from the blocked face, so that the traditional face recognition algorithm can be directly applied. The latter approach reduces the impact of occlusion on the recognition result by explicitly excluding occlusion regions.
However, the scheme of restoring the occluded features may result in loss of identity information or errors due to the composition of the occluded areas, thereby reducing the results. The solution of removing occluded features requires deleting occluded corrupted features, identifying with the remaining non-occluded corrupted features, still has difficulty ignoring the occlusion because it is often difficult to locate the occlusion directly, and has to take expensive time to train additional occlusion detectors. Furthermore, these networks designed for occlusion, while performing slightly better than other models, are often difficult to trust and not robust because they require a trade-off between occlusion and non-occlusion performance, i.e., it is difficult to accurately occlude face recognition without affecting non-occlusion face recognition. The performance of the non-occlusion face recognition model on the occlusion face is obviously reduced due to inconsistent information caused by the damaged features. However, forcing attention to corrupted features will result in degraded performance of the occlusion recognition network in non-occlusion recognition.
In summary, how to accurately identify the face image is a problem to be solved by those skilled in the art.
Disclosure of Invention
The purpose of the application is to provide a face recognition method which can solve the technical problem of how to accurately recognize face images to a certain extent. The application also provides a face recognition system, a device and a computer readable storage medium.
In order to achieve the above object, the present application provides the following technical solutions:
a face recognition method, comprising:
acquiring a target face image to be identified;
performing face alignment on the target face image to obtain a target processing image;
performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result;
the mask guiding fusion attention network is connected with the mask guiding fusion attention network and is used for obtaining the face recognition result.
Preferably, the feature extraction backbone network comprises a linear embedded layer, a first swin transformer layer connected with the linear embedded layer, and a preset number of feature extraction layers connected with the first swin transformer layer in sequence;
the feature extraction layer comprises a first block merging layer and a second swin transformer layer which are connected in sequence.
Preferably, the occlusion segmentation branch network comprises a preset number of segmentation extraction layers, a first interpolation layer and a linear projection layer which are sequentially connected with the output of the feature extraction backbone network;
the segmentation extraction layer comprises a block expansion layer, a third swin transformer layer and a first splicing layer which are sequentially connected, wherein the first splicing layer is used for splicing input features through jump connection and outputting the spliced features; and the first splicing layer of the ith split extraction layer connected with the output of the feature extraction backbone network is connected with the input of the (n+1-i) th feature extraction layer, wherein n represents the preset number of values.
Preferably, the Mask-guided Attention (MGA) layer includes a preset number of Attention-Fused (FA) layers, which are sequentially connected to the output of the feature extraction backbone network, and the MGA layer is connected to the output of the occlusion split branching network;
the FA layer comprises a block expansion layer, a second splicing layer connected with the block expansion layer and an output layer of the upper layer, and a convolution dimension reduction layer connected with the second splicing layer; the block expansion layer of the ith FA layer is connected with the first splicing layer of the ith segmentation extraction layer; the output of the upper layer of the first FA layer is the output of the feature extraction backbone network;
the MGA layer comprises a second interpolation layer connected with the output of the shielding and dividing branch network, an index operation layer connected with the output of the last FA layer, an element multiplication layer connected with the second interpolation layer and the index operation layer, a second block merging layer, a fourth swin transformer layer and an average pooling layer which are sequentially connected with the element multiplication layer.
Preferably, the identification network comprises an MLP network.
Preferably, the loss function of the occlusion aware guide network includes:
L overall =αL seg +βL cls
wherein L is overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) seg Representing a loss of the occlusion splitting branch network; l (L) cls Representing a loss of the identification network;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ j Representing the weight W j And predictive feature x i An angle therebetween; />Representing the weight W j And true feature y i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x i Is a length of (c).
Preferably, the loss function of the occlusion aware guide network includes:
L overall =αL seg +βL cls +λφ fusion
wherein L is overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) seg Representing a loss of the occlusion splitting branch network; l (L) cls Representing a loss of the identification network; phi (phi) fusion Representation ofFusion constraints with adaptive weights λ;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ j Representing the weight W j And predictive feature x i An angle therebetween; />Representing the weight W j And true feature y i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x i Is a length of (2); />Representing the Euclidean distance; f (F) att Representing the non-occlusion feature attention pattern of the synthesized occlusion face image output by the MGA layer; f (F) a ' tt A attention feature map representing the non-occlusion face image output by the MGA layer, and F a ' tt Occlusion features in the generation process come from F att Occlusion features in the generation process.
A face recognition system, comprising:
the first acquisition module is used for acquiring a target face image to be identified;
the first alignment module is used for carrying out face alignment on the target face image to obtain a target processing image;
the first recognition module is used for recognizing the face of the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result;
the mask guiding fusion attention network is connected with the mask guiding fusion attention network and is used for obtaining the face recognition result.
A face recognition device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the face recognition method as described in any one of the above when executing the computer program.
A computer readable storage medium having stored therein a computer program which when executed by a processor performs the steps of the face recognition method as claimed in any one of the preceding claims.
According to the face recognition method, a target face image to be recognized is obtained; performing face alignment on the target face image to obtain a target processing image; performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result; the mask guiding and merging attention network is connected with the mask guiding and merging attention network and is used for obtaining a face recognition result. The shielding perception guiding network can automatically position various shielding in the face image and eliminate the influence of damage features, meanwhile, the non-shielding features are not influenced, and the face image can be accurately identified. The face recognition system, the face recognition device and the computer readable storage medium also solve the corresponding technical problems.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.
Fig. 1 is a flowchart of a face recognition method provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of the structure of an occlusion aware bootstrap network;
FIG. 3 is a schematic diagram of a non-occlusion region of a split branch prediction;
FIG. 4 is phi fusion Is a calculation schematic diagram of (1);
FIG. 5 is a schematic diagram of the structure of a swin transducer layer;
fig. 6 is a flowchart of a face recognition system according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a face recognition device according to an embodiment of the present application;
fig. 8 is another schematic structural diagram of a face recognition device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Referring to fig. 1, fig. 1 is a flowchart of a face recognition method according to an embodiment of the present application.
The face recognition method provided by the embodiment of the application can comprise the following steps:
step S101: and acquiring a target face image to be identified.
In practical application, a target face image to be identified can be acquired first, and information of the target face image can be determined according to practical needs, for example, the target face image can be a face image with a mask or a face image with glasses.
Step S102: and carrying out face alignment on the target face image to obtain a target processing image.
In practical application, after the target face image to be identified is obtained, face alignment can be performed on the target face image to obtain a target processing image, specifically, five landmarks (two eyes, nose and two mouth corners of each face image) of the target face image can be detected based on a standard multitasking convolutional neural network (Multi-Task Convolutional Neural Network, MTCNN) to correctly align and clip the face image, and the face image can be scaled to obtain a face image with a size of 112×112. In addition, in order to facilitate the subsequent processing, the pixel values in the target processing image may be normalized to [ -1.0,1.0] or the like.
Step S103: performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result; the mask guiding and merging attention network is connected with the mask guiding and merging attention network and is used for obtaining a face recognition result.
In practical application, after the target processing image is obtained, the face recognition can be performed on the target image based on the shielding perception guiding network provided by the application, so that a face recognition result is obtained. Because the core idea of the application is to automatically focus on undamaged features outside shielding to guide visible face feature learning, in order to realize the idea, shielding segmentation branches are added and shielding mask supervision signals are introduced to guide the architecture to focus on non-shielding features. For this reason, an occlusion aware guidance network MGFace for face recognition proposed in the present application has novel occlusion segmentation branches and mask guidance fusion attention. The MGFace consists of three key networks, namely a feature extraction backbone network, an occlusion split branch network and a mask-guided fused attention network, and the structure of the MGFace can be shown in fig. 2.
In a specific application scenario, global information is essential due to occlusion diversity. However, the inherent locality of convolution kernels makes it difficult to capture long-range dependencies that can be effectively captured by self-attention mechanisms. In order to explore the difference between occlusion pixels and face background, the feature extraction backbone network of the present application is implemented by a swin transformer, which is the most recent variant of a widely used transformer, specifically, MGFace obtains a small batch of randomly occluded or non-occluded facial images as input, and the feature extraction backbone network obtains self-attention features through swin-transformers, so that the learning process of occlusion segmentation and face recognition is more stable, and a more accurate mask is obtained to produce better accuracy. The feature extraction backbone network is a standard swin transformer, first cuts the picture into several tiles, and embeds each tile into a fixed length vector representation using a Linear Embedding layer (Linear Embedding). To generate the hierarchical representation, the size is reduced by the tile merge layer (Patch merge) and the feature dimension is increased to 2 x the original dimension (equivalent to 2 times downsampling). The feature extraction backbone network is used for extracting a shared self-attention feature map for subsequent final targets of occlusion segmentation branching and face recognition. Unlike conventional transformers, which take into account the dimensional diversity of visual elements, which use Multi-headed self-attention (Multi-Task Self Attention, MSA) modules, swin transformers are built based on Sliding Windows (SW), which may include window-based Multi-headed self-attention mechanisms (W-MSA) and sliding window-based Multi-headed self-attention (SW-MSA) modules. In other words, the feature extraction backbone network in the present application includes a linear embedded layer, a first swin transformer layer connected to the linear embedded layer, and a preset number of feature extraction layers sequentially connected to the first swin transformer layer; the feature extraction layer comprises a first block merging layer and a second swin transformer layer which are connected in sequence.
In a specific application scenario, the characteristics of the occlusion areas are destroyed by occlusion, so that the areas cannot provide effective face information for face recognition. In order to extract deep features from non-occlusion information areas, the occlusion segmentation branches perform non-occlusion region segmentation tasks to guide the architecture to focus on non-occlusion features. And extracting a feature map based on the shared feature extraction backbone network with the swin transformer, and reflecting the probability that each pixel belongs to a non-occlusion region by decoding after the segmentation branch network is occluded. Similar to the U-Net encoder-decoder architecture, the occlusion partition branches consist essentially of a tile extension layer (Patch extension) and a switch transform block. The tile extension layer is used in the occlusion branches to upsample the extracted depth features and then flow into the swin transformer block. Like U-Net, the jump connection is also used to fuse multi-scale features from the feature extraction backbone network. The shallow layer features and the deep layer features are connected together through jump connection, so that the loss of spatial information caused by downsampling is reduced. Finally, the output is mapped to a non-occlusion feature attention mask by interpolation operations (interpolation) and linear Projection layers (line Projection). In other words, the occlusion segmentation branch network comprises a preset number of segmentation extraction layers, a first interpolation layer and a linear projection layer which are sequentially connected with the output of the feature extraction backbone network; the segmentation extraction layer comprises a block expansion layer, a third swin transformer layer and a first splicing layer which are sequentially connected, wherein the first splicing layer is used for splicing input features through jump connection and outputting the spliced features; and a first splice layer of an ith split extraction layer connected to an output of the feature extraction backbone network is connected to an input of an (n+1-i) th feature extraction layer, n representing a preset number of values. The predicted occlusion segmentation map of the occlusion segmentation branch network on the input face image can be as shown in fig. 3, etc.
In a specific application scene, in order to fuse local mask guiding attention information of non-shielding features under different scales, the application provides a mask guiding and fusing attention network, which integrates global interaction among different scales and enhances the denaturation such as translation of MGface. The Mask-guided Attention network is composed of two parts, fusion Attention (FA) and Mask-guided Attention (MGA). For the kth stageSelf-attention feature map F of (1) k The tile extension layer in the FA is used for up-sampling firstly, then the tile extension layer is spliced with the result of the swin transducer block with the same size from the shielding segmentation branch, and the spliced result Y is obtained k Self-attention feature map F for the (k+1) stage is reduced to be a vitamin by 1X 1 convolution k+1 . After multiple FA fusions, the self-attention feature map F is upsampled to by interpolation in MGAAnd is in close relation to the shielding division mask>Having the same size. Then, the shielding division mask with the shielding region suppressed +.>An index process is carried out, and the index is processed,is non-occlusion feature attention information and is associated with a self-attention feature map +.>The previous processing. With the exponential operation, the result of occlusion segmentation output can be kept to focus on non-occlusion features while reducing the negative effects of erroneous predictions, especially for incorrect zero output. Finally, final fusion feature F att Obtaining a non-occlusion attention profile by swin transducer and averaging pooling>Information redundancy is reduced and overfitting is prevented. In other words, the mask-guided fused attention network in the present application includes a preset number of FA layers, MGA layers sequentially connected with the output of the feature extraction backbone network, and the MGA layers are connected with the output of the occlusion split branch network; wherein the FA layer comprises a block expansion layer, a second splicing layer connected with the block expansion layer and the output layer of the upper layer, and a second splicing layerA convolution dimension reduction layer connected with the connecting layer; the block expansion layer of the ith FA layer is connected with the first splicing layer of the ith segmentation extraction layer; the output of the upper layer of the first FA layer is used as the characteristic to extract the output of the backbone network; the MGA layer comprises a second interpolation layer connected with the output of the shielding and dividing branch network, an index operation layer connected with the output of the last FA layer, an element multiplication layer connected with the second interpolation layer and the index operation layer, a second block merging layer, a fourth swin transformer layer and an average pooling layer which are sequentially connected with the element multiplication layer.
In a specific application scenario, the recognition network may include an MLP network, and the face recognition result may include the identity of the face user and the like.
In a specific application scenario, the loss function of the occlusion awareness guidance network in the training process may include:
L overall =αL seg +βL cls
wherein L is overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) seg Representing a loss of blocking the split branch network; l (L) cls Representing a loss of the identification network;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ j Representing the weight W j And predictive feature x i An angle therebetween; />Representing the weight W j And true feature y i Between (a) and (b)An angle; m represents an additional angular margin penalty; s represents the prediction feature x i Is a length of (c).
In a specific application scenario, many high-performance face recognition models cannot recognize an occlusion face, so that it is difficult to generalize under both an unoccluded face and an occlusion face, because the occlusion face is regarded as an outlier in nature due to destruction of the occlusion. However, when samples of both occluded and unoccluded faces are used to train the network to generate discriminant feature embedding space, it is difficult to optimize the model to accommodate all sample types. For this purpose, the present application uses synthetic occlusion faces for training. The fusion attention module is guided based on shielding segmentation branches and masks, and fusion constraint phi is further introduced fusion To reduce the interference of occlusion on the consistency of the characteristics of the occluded face and the non-occluded face, and to calculate phi by introducing the original non-occluded face before occlusion synthesis fusion As shown in fig. 4, in other words, the loss function of the occlusion-aware guided network during training may include:
L overall =αL seg +βL cls +λφ fusion
wherein L is overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) seg Representing a loss of blocking the split branch network; l (L) cls Representing a loss of the identification network; phi (phi) fusion Representing a fusion constraint with an adaptive weight λ;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ j Representing the weight W j And predictive feature x i An angle therebetween; />Representing the weight W j And true feature y i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x i Is a length of (2); />Representing the Euclidean distance; f (F) att Representing the non-occlusion feature attention pattern of the synthesized occlusion face image output by the MGA layer; f'. att Attention characteristic diagram representing non-occlusion face image output by MGA layer and F' att Occlusion features in the generation process come from F att Occlusion features in the generation process.
F as shown in FIG. 4 a ' tt Is a attention feature map calculated in a mask-guided fusion attention module, Y k Andmasks from uncorrupted regions in the composite occlusion face segmented by occlusion segmentation branches, which masks are shared with the original non-occlusion face. F if the sample is not enhanced by synthetic occlusion att -F’ att Then it will be zero which will not affect the optimisation of the network. By minimizing L 2 The distance between the characteristic attention information of the blocked face and the non-blocked face is fused with constraint phi fusion Encouraging models to predict non-changing when occluding and non-occluding faces helps generalize. These objectives should be balanced in order to obtain satisfactory face recognition results. In the initial learning phase, the present application focuses more on occlusion segmentation and face recognition, as this is a prerequisite. When the progress is smooth, the application turns the learning center to the fusion constraint phi fusion This focuses the architecture on reducing unobscured faces and occluded facesSimilarity distance between the two. That is, lambda is almost 0 at the beginning, then, when L seg And L cls Oscillation causes a varying amplitude L seg And L cls In a very small enough iteration, λ is increased to a relatively high value, eventually completing the training of the model.
In a specific application scenario, the structures of the first swin transformer layer, the second swin transformer layer, the third swin transformer layer, and the fourth swin transformer layer may be as shown in fig. 5, and a pair of consecutive swin transformer blocks may be expressed as:
wherein,and z l First swin transducer block representing the outputs of the W-MSA module and MLP, respectively, ">And z l +1 Represents the (l+1) th swin transducer block of the SW-MSA module and the MLP output. And consistent with the transducer, the swin transducer self-attention may be calculated as follows:
wherein Q represents a query matrix; k represents a key matrix; v represents a matrix of values; d represents the dimension of the key matrix; b represents a bias matrix; t represents the transpose of the matrix.
According to the face recognition method, a target face image to be recognized is obtained; performing face alignment on the target face image to obtain a target processing image; performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result; the mask guiding and merging attention network is connected with the mask guiding and merging attention network and is used for obtaining a face recognition result. The shielding perception guiding network can automatically position various shielding in the face image and eliminate the influence of damage features, meanwhile, the non-shielding features are not influenced, and the face image can be accurately identified.
Referring to fig. 6, fig. 6 is a flowchart of a face recognition system according to an embodiment of the present application.
The face recognition system provided in the embodiment of the application may include:
an acquisition module 101, configured to acquire a target face image to be identified;
the alignment module 102 is configured to perform face alignment on a target face image to obtain a target processing image;
the recognition module 103 is used for recognizing the face of the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result;
the mask guiding and merging attention network is connected with the mask guiding and merging attention network and is used for obtaining a face recognition result.
The face recognition system provided by the embodiment of the application comprises a linear embedded layer, a first swin transformer layer connected with the linear embedded layer, and a preset number of feature extraction layers sequentially connected with the first swin transformer layer;
the feature extraction layer comprises a first block merging layer and a second swin transformer layer which are connected in sequence.
The face recognition system provided by the embodiment of the application comprises a preset number of segmentation extraction layers, a first interpolation layer and a linear projection layer which are sequentially connected with the output of a feature extraction backbone network;
the segmentation extraction layer comprises a block expansion layer, a third swin transformer layer and a first splicing layer which are sequentially connected, wherein the first splicing layer is used for splicing input features through jump connection and outputting the spliced features; and a first splice layer of an ith split extraction layer connected to an output of the feature extraction backbone network is connected to an input of an (n+1-i) th feature extraction layer, n representing a preset number of values.
According to the face recognition system provided by the embodiment of the application, the mask-guided fusion attention network comprises a preset number of FA layers and MGA layers which are sequentially connected with the output of the feature extraction backbone network, and the MGA layers are connected with the output of the shielding segmentation branch network;
the FA layer comprises a block expansion layer, a second splicing layer connected with the block expansion layer and an output layer of the upper layer, and a convolution dimension reduction layer connected with the second splicing layer; the block expansion layer of the ith FA layer is connected with the first splicing layer of the ith segmentation extraction layer; the output of the upper layer of the first FA layer is used as the characteristic to extract the output of the backbone network;
the MGA layer comprises a second interpolation layer connected with the output of the shielding and dividing branch network, an index operation layer connected with the output of the last FA layer, an element multiplication layer connected with the second interpolation layer and the index operation layer, a second block merging layer, a fourth swin transformer layer and an average pooling layer which are sequentially connected with the element multiplication layer.
The face recognition system provided by the embodiment of the application, the recognition network comprises an MLP network.
The embodiment of the application provides a face recognition system, and the loss function of shielding perception guiding network includes:
L overall =αL seg +βL cls
wherein L is overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) seg Representing a loss of blocking the split branch network; l (L) cls Representing a loss of the identification network;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ j Representing the weight W j And predictive feature x i An angle therebetween; />Representing the weight W j And true feature y i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x i Is a length of (c).
The embodiment of the application provides a face recognition system, and the loss function of shielding perception guiding network includes:
L overall =αL seg +βL cls +λφ fusion
wherein L is overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) seg Representing a loss of blocking the split branch network; l (L) cls Representing a loss of the identification network; phi (phi) fusion Representing a fusion constraint with an adaptive weight λ;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ j Representing the weight W j And predictive feature x i An angle therebetween; />Representing the weight W j And true feature y i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x i Is a length of (2); />Representing the Euclidean distance; f (F) att Representing the non-occlusion feature attention pattern of the synthesized occlusion face image output by the MGA layer; f (F) a ' tt Attention feature map representing non-occlusion face image output by MGA layer, and F a ' tt Occlusion features in the generation process come from F att Occlusion features in the generation process.
The application also provides face recognition equipment and a computer readable storage medium, which have the corresponding effects of the face recognition method. Referring to fig. 7, fig. 7 is a schematic structural diagram of a face recognition device according to an embodiment of the present application.
The face recognition device provided in the embodiment of the present application includes a memory 201 and a processor 202, where the memory 201 stores a computer program, and the processor 202 implements the steps of the face recognition method described in any of the embodiments above when executing the computer program.
Referring to fig. 8, another face recognition device provided in an embodiment of the present application may further include: an input port 203 connected to the processor 202 for transmitting an externally input command to the processor 202; a display unit 204 connected to the processor 202, for displaying the processing result of the processor 202 to the outside; and the communication module 205 is connected with the processor 202 and is used for realizing communication between the face recognition device and the outside. The display unit 204 may be a display panel, a laser scanning display, or the like; communication means employed by the communication module 205 include, but are not limited to, mobile high definition link technology (HML), universal Serial Bus (USB), high Definition Multimedia Interface (HDMI), wireless connection: wireless fidelity (WiFi), bluetooth communication, bluetooth low energy communication, ieee802.11s based communication.
The embodiment of the application provides a computer readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the steps of the face recognition method described in any embodiment above are implemented.
The computer readable storage medium referred to in this application includes Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The description of the relevant parts in the face recognition system, the device and the computer readable storage medium provided in the embodiments of the present application refers to the detailed description of the corresponding parts in the face recognition method provided in the embodiments of the present application, and will not be repeated here. In addition, the parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of the corresponding technical solutions in the prior art, are not described in detail, so that redundant descriptions are avoided.
It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A face recognition method, comprising:
acquiring a target face image to be identified;
performing face alignment on the target face image to obtain a target processing image;
performing face recognition on the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result;
the mask guiding fusion attention network is connected with the mask guiding fusion attention network and is used for obtaining the face recognition result;
the feature extraction backbone network comprises a linear embedded layer, a first swin transformer layer connected with the linear embedded layer, and a preset number of feature extraction layers sequentially connected with the first swin transformer layer; the feature extraction layer comprises a first block merging layer and a second swin transformer layer which are connected in sequence;
the shielding segmentation branch network comprises a preset number of segmentation extraction layers, a first interpolation layer and a linear projection layer which are sequentially connected with the output of the characteristic extraction backbone network; the segmentation extraction layer comprises a block expansion layer, a third swin transformer layer and a first splicing layer which are sequentially connected, and the first splicing layer is used for splicing input features through jump connection and outputting the spliced features; the first splicing layer of the ith segmentation extraction layer connected with the output of the characteristic extraction backbone network is connected with the input of the (n+1-i) th characteristic extraction layer, and n represents the value of the preset quantity;
the mask-guided fused attention network comprises a preset number of fused attention layers FA and mask-guided attention layers MGA which are sequentially connected with the output of the feature extraction backbone network, and the MGA layers are connected with the output of the shielding segmentation branch network; the FA layer comprises a block expansion layer, a second splicing layer connected with the block expansion layer and an output layer of the upper layer, and a convolution dimension reduction layer connected with the second splicing layer; the block expansion layer of the ith FA layer is connected with the first splicing layer of the ith segmentation extraction layer; the output of the upper layer of the first FA layer is the output of the feature extraction backbone network; the MGA layer comprises a second interpolation layer connected with the output of the shielding and dividing branch network, an index operation layer connected with the output of the last FA layer, an element multiplication layer connected with the second interpolation layer and the index operation layer, a second block merging layer, a fourth swin transformer layer and an average pooling layer which are sequentially connected with the element multiplication layer.
2. The method of claim 1, wherein the identification network comprises an MLP network.
3. The method according to any of claims 1 to 2, wherein the loss function of the occlusion-aware bootstrap network comprises:
L overall =αL seg +βL cls
wherein L is overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) seg Representing a loss of the occlusion splitting branch network; l (L) cls Representing a loss of the identification network;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ j Representing the weight W j And predictive feature x i An angle therebetween; />Representing the weight W j And true feature y i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x i Is a length of (c).
4. The method according to any of claims 1 to 2, wherein the loss function of the occlusion-aware bootstrap network comprises:
L overall =αL seg +βL cls +λφ fusion
wherein L is overall Representing a loss value; alpha represents a weight value of non-occlusion region segmentation; beta represents the weight value of face recognition; l (L) seg Representing a loss of the occlusion splitting branch network; l (L) cls Representing a loss of the identification network; phi (phi) fusion Representing a fusion constraint with an adaptive weight λ;representing a predicted occlusion segmentation map; y represents a real occlusion segmentation map; n represents the total training sample number; n represents the number of categories; θ j Representing the weight W j And predictive feature x i An angle therebetween; />Representing the weight W j And true feature y i An angle therebetween; m represents an additional angular margin penalty; s represents the prediction feature x i Is a length of (2); />Representing the Euclidean distance; f (F) att Representing the non-occlusion feature attention pattern of the synthesized occlusion face image output by the MGA layer; f (F) a ' tt A attention feature map representing the non-occlusion face image output by the MGA layer, and F a ' tt Occlusion features in the generation process come from F att Occlusion features in the generation process.
5. A face recognition system, comprising:
the acquisition module is used for acquiring a target face image to be identified;
the alignment module is used for carrying out face alignment on the target face image to obtain a target processing image;
the recognition module is used for recognizing the face of the target processing image based on a pre-trained shielding perception guiding network to obtain a face recognition result;
the mask guiding fusion attention network is connected with the mask guiding fusion attention network and is used for obtaining the face recognition result;
the feature extraction backbone network comprises a linear embedded layer, a first swin transformer layer connected with the linear embedded layer, and a preset number of feature extraction layers sequentially connected with the first swin transformer layer; the feature extraction layer comprises a first block merging layer and a second swin transformer layer which are connected in sequence;
the shielding segmentation branch network comprises a preset number of segmentation extraction layers, a first interpolation layer and a linear projection layer which are sequentially connected with the output of the characteristic extraction backbone network; the segmentation extraction layer comprises a block expansion layer, a third swin transformer layer and a first splicing layer which are sequentially connected, and the first splicing layer is used for splicing input features through jump connection and outputting the spliced features; the first splicing layer of the ith segmentation extraction layer connected with the output of the characteristic extraction backbone network is connected with the input of the (n+1-i) th characteristic extraction layer, and n represents the value of the preset quantity;
the mask-guided fused attention network comprises a preset number of fused attention layers FA and mask-guided attention layers MGA which are sequentially connected with the output of the feature extraction backbone network, and the MGA layers are connected with the output of the shielding segmentation branch network; the FA layer comprises a block expansion layer, a second splicing layer connected with the block expansion layer and an output layer of the upper layer, and a convolution dimension reduction layer connected with the second splicing layer; the block expansion layer of the ith FA layer is connected with the first splicing layer of the ith segmentation extraction layer; the output of the upper layer of the first FA layer is the output of the feature extraction backbone network; the MGA layer comprises a second interpolation layer connected with the output of the shielding and dividing branch network, an index operation layer connected with the output of the last FA layer, an element multiplication layer connected with the second interpolation layer and the index operation layer, a second block merging layer, a fourth swin transformer layer and an average pooling layer which are sequentially connected with the element multiplication layer.
6. A face recognition device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the face recognition method according to any one of claims 1 to 4 when executing the computer program.
7. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the steps of the face recognition method according to any one of claims 1 to 4.
CN202310558627.8A 2023-05-17 2023-05-17 Face recognition method, system, equipment and computer readable storage medium Active CN116563926B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310558627.8A CN116563926B (en) 2023-05-17 2023-05-17 Face recognition method, system, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310558627.8A CN116563926B (en) 2023-05-17 2023-05-17 Face recognition method, system, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN116563926A CN116563926A (en) 2023-08-08
CN116563926B true CN116563926B (en) 2024-03-01

Family

ID=87491301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310558627.8A Active CN116563926B (en) 2023-05-17 2023-05-17 Face recognition method, system, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116563926B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728330A (en) * 2019-10-23 2020-01-24 腾讯科技(深圳)有限公司 Object identification method, device, equipment and storage medium based on artificial intelligence
CN111639596A (en) * 2020-05-29 2020-09-08 上海锘科智能科技有限公司 Anti-glasses-shielding face recognition method based on attention mechanism and residual error network
CN111814603A (en) * 2020-06-23 2020-10-23 汇纳科技股份有限公司 Face recognition method, medium and electronic device
CN111898413A (en) * 2020-06-16 2020-11-06 深圳市雄帝科技股份有限公司 Face recognition method, face recognition device, electronic equipment and medium
CN111914748A (en) * 2020-07-31 2020-11-10 平安科技(深圳)有限公司 Face recognition method and device, electronic equipment and computer readable storage medium
CN112766158A (en) * 2021-01-20 2021-05-07 重庆邮电大学 Multi-task cascading type face shielding expression recognition method
CN112949565A (en) * 2021-03-25 2021-06-11 重庆邮电大学 Single-sample partially-shielded face recognition method and system based on attention mechanism
CN113807332A (en) * 2021-11-19 2021-12-17 珠海亿智电子科技有限公司 Mask robust face recognition network, method, electronic device and storage medium
CN114266946A (en) * 2021-12-31 2022-04-01 智慧眼科技股份有限公司 Feature identification method and device under shielding condition, computer equipment and medium
WO2022213349A1 (en) * 2021-04-09 2022-10-13 鸿富锦精密工业(武汉)有限公司 Method and apparatus for recognizing face with mask, and computer storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728330A (en) * 2019-10-23 2020-01-24 腾讯科技(深圳)有限公司 Object identification method, device, equipment and storage medium based on artificial intelligence
CN111639596A (en) * 2020-05-29 2020-09-08 上海锘科智能科技有限公司 Anti-glasses-shielding face recognition method based on attention mechanism and residual error network
CN111898413A (en) * 2020-06-16 2020-11-06 深圳市雄帝科技股份有限公司 Face recognition method, face recognition device, electronic equipment and medium
CN111814603A (en) * 2020-06-23 2020-10-23 汇纳科技股份有限公司 Face recognition method, medium and electronic device
CN111914748A (en) * 2020-07-31 2020-11-10 平安科技(深圳)有限公司 Face recognition method and device, electronic equipment and computer readable storage medium
CN112766158A (en) * 2021-01-20 2021-05-07 重庆邮电大学 Multi-task cascading type face shielding expression recognition method
CN112949565A (en) * 2021-03-25 2021-06-11 重庆邮电大学 Single-sample partially-shielded face recognition method and system based on attention mechanism
WO2022213349A1 (en) * 2021-04-09 2022-10-13 鸿富锦精密工业(武汉)有限公司 Method and apparatus for recognizing face with mask, and computer storage medium
CN113807332A (en) * 2021-11-19 2021-12-17 珠海亿智电子科技有限公司 Mask robust face recognition network, method, electronic device and storage medium
CN114266946A (en) * 2021-12-31 2022-04-01 智慧眼科技股份有限公司 Feature identification method and device under shielding condition, computer equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Learning 3D Face Representation with Vision Transformer for Masked Face Recognition;Yuan Wang;《2022 Asia Conference on Algorithms, Computing and Machine Learning (CACML)》;第505-511页 *
面向非限制条件的人脸识别研究;涂晓光;《中国博士学位论文全文数据库(信息科技辑)》(第03期);I138-38 *

Also Published As

Publication number Publication date
CN116563926A (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN112668573B (en) Target detection position reliability determination method and device, electronic equipment and storage medium
CN110705566B (en) Multi-mode fusion significance detection method based on spatial pyramid pool
Shao et al. Uncertainty guided multi-scale attention network for raindrop removal from a single image
CN111723707A (en) Method and device for estimating fixation point based on visual saliency
CN112446322B (en) Eyeball characteristic detection method, device, equipment and computer readable storage medium
US11823437B2 (en) Target detection and model training method and apparatus, device and storage medium
CN111985374A (en) Face positioning method and device, electronic equipment and storage medium
CN114758255A (en) Unmanned aerial vehicle detection method based on YOLOV5 algorithm
CN114898111B (en) Pre-training model generation method and device, and target detection method and device
CN115577768A (en) Semi-supervised model training method and device
CN115984714A (en) Cloud detection method based on double-branch network model
CN116563840B (en) Scene text detection and recognition method based on weak supervision cross-mode contrast learning
CN116563926B (en) Face recognition method, system, equipment and computer readable storage medium
CN114565953A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN116258931B (en) Visual finger representation understanding method and system based on ViT and sliding window attention fusion
CN112862840B (en) Image segmentation method, device, equipment and medium
CN112052863B (en) Image detection method and device, computer storage medium and electronic equipment
CN114581353A (en) Infrared image processing method and device, medium and electronic equipment
CN114820723A (en) Online multi-target tracking method based on joint detection and association
CN114022458A (en) Skeleton detection method and device, electronic equipment and computer readable storage medium
CN115147434A (en) Image processing method, device, terminal equipment and computer readable storage medium
CN116821699B (en) Perception model training method and device, electronic equipment and storage medium
CN116403269B (en) Method, system, equipment and computer storage medium for analyzing occlusion human face
Lin et al. P‐2.23: Deep Learning‐based System for Processing Complex Floorplans
CN117975449A (en) Semi-supervised cell instance segmentation method based on multitask learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: No. 205, Building B1, Huigu Science and Technology Industrial Park, No. 336 Bachelor Road, Bachelor Street, Yuelu District, Changsha City, Hunan Province, 410000

Patentee after: Wisdom Eye Technology Co.,Ltd.

Country or region after: China

Address before: 410205, Changsha high tech Zone, Hunan Province, China

Patentee before: Wisdom Eye Technology Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address