CN115588218A

CN115588218A - Face recognition method and device

Info

Publication number: CN115588218A
Application number: CN202211055844.7A
Authority: CN
Inventors: 王夏洪
Original assignee: Beijing Longzhi Digital Technology Service Co Ltd
Current assignee: Beijing Longzhi Digital Technology Service Co Ltd
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2023-01-10
Also published as: WO2024045320A1

Abstract

The disclosure relates to the technical field of computers, and provides a face recognition method and device. The method comprises the following steps: acquiring a first feature map of a face image to be recognized; carrying out depth-by-depth convolution processing on the first feature map to obtain a second feature map; performing attention circulation processing on the second feature map to obtain a third feature map; and performing convolution processing of increasing channels, attention circulation processing, convolution processing of reducing channels and attention circulation processing on the third feature diagram in sequence to obtain a target feature diagram corresponding to the first feature diagram.

Description

Face recognition method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a face recognition method and apparatus.

Background

The face technology is often required to be deployed to a cloud end and an edge end in the practical application process, and is limited by the computing power and storage resources of the edge end such as an embedded terminal, and the edge end face recognition model meets the requirements of small model size, low computing complexity, high reasoning speed and the like while meeting the high-precision requirement.

In the related art, common lightweight networks capable of realizing the face recognition task include squeezet, mobileNet, shuffleNet and the like, and due to the particularity of the face structure, the models have poor accuracy on the face recognition task. The mobile terminal lightweight network MobileFaceNet specially designed for the face recognition task adopts smaller expansion rate based on MobileNet, and replaces the global average pooling layer with a global depth-by-depth convolution layer. However, the main building module of MobileFaceNet still adopts a common residual bottleneck module, and the calculation of each module is also the same, so that the problem of poor precision is also caused.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a face recognition method, a face recognition apparatus, an electronic device, and a computer-readable storage medium, so as to solve the problem in the prior art that the accuracy of a face recognition model is not good enough.

In a first aspect of the embodiments of the present disclosure, a face recognition method is provided, where the method includes: acquiring a first feature map of a face image to be recognized; carrying out depth-by-depth convolution processing on the first feature map to obtain a second feature map; performing attention transfer processing on the second feature map to obtain a third feature map; and performing convolution processing of increasing channels, attention circulation processing, convolution processing of reducing channels and attention circulation processing on the third feature diagram in sequence to obtain a target feature diagram corresponding to the first feature diagram.

In a second aspect of the embodiments of the present disclosure, a face recognition apparatus is provided, the apparatus including: the acquisition module is used for acquiring a first feature map of a face image to be recognized; the convolution module is used for carrying out depth-by-depth convolution processing on the first feature map to obtain a second feature map; the attention circulation module is used for carrying out attention circulation processing on the second feature map to obtain a third feature map; and the mixed processing module is used for sequentially performing convolution processing of increasing channels, attention circulation processing, convolution processing of reducing channels and attention circulation processing on the third feature map to obtain a target feature map corresponding to the first feature map.

In a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned method.

Compared with the prior art, the embodiment of the disclosure has the following beneficial effects: the feature map processing of the face recognition is carried out through the combination of the convolution processing and the attention circulation processing, the circulation of attention in multiple direction dimensions is promoted, the finally obtained feature map has high discrimination on all the direction dimensions, and therefore the recognition accuracy of the face recognition model is improved.

Specifically, the embodiment of the present disclosure provides a lightweight attention circulation module, where the tensor dimension of the attention circulation module is very low, the convolution calculation amount of the low-dimensional tensor is very small, and a relatively fast overall operation speed can be realized. If the whole network is subjected to feature extraction in a low-dimensional space, the incompleteness of information and the unreliability of features are possibly caused, and in the embodiment of the disclosure, the channel number expansion of the set expansion coefficient is performed in the middle convolution processing process, so that the feature extraction capability of the whole module can be improved, and a delicate balance between the calculated amount and the feature expression capability is achieved.

In the embodiment of the disclosure, the whole attention circulation module enables attention flows concerned by a face recognition task to be circulated and converted between space and channels through the combination of operations such as convolution, expansion and compression of channel numbers, attention circulation technologies and the like of different types, the feature fusion is more efficient, the feature map is finally and effectively focused on an interested area of face recognition, and in addition, the attention circulation module has the advantages of small parameter number, small calculated amount and high speed.

Drawings

To more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive efforts.

FIG. 1 is a scenario diagram of an application scenario of an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of a face recognition method according to an embodiment of the present disclosure;

fig. 3 is a flow chart of attention flow processing provided by an embodiment of the present disclosure;

fig. 4 is a schematic flow chart of another face recognition method provided in the embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a face recognition apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

A face recognition method and apparatus according to an embodiment of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a scene schematic diagram of an application scenario of an embodiment of the present disclosure. The application scenario may include

terminal devices

101, 102, and 103, server 104, and network 105.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When

terminal devices

101, 102, and 103 are hardware, they may be various electronic devices having a display screen and supporting communication with server 104, including but not limited to smart phones, robots, laptop portable computers, desktop computers, and the like (e.g., 102 may be a robot); when the

terminal apparatuses

101, 102, and 103 are software, they can be installed in the electronic apparatus as above. The

terminal devices

101, 102, and 103 may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not limited by the embodiments of the present disclosure. Further, various applications, such as data processing applications, instant messaging tools, social platform software, search-type applications, shopping-type applications, etc., may be installed on the

terminal devices

101, 102, and 103.

The server 104 may be a server providing various services, for example, a backend server receiving a request sent by a terminal device establishing a communication connection with the server, and the backend server may receive and analyze the request sent by the terminal device and generate a processing result. The server 104 may be a server, may also be a server cluster composed of a plurality of servers, or may also be a cloud computing service center, which is not limited in this disclosure.

The server 104 may be hardware or software. When the server 104 is hardware, it may be various electronic devices that provide various services to the

terminal devices

101, 102, and 103. When the server 104 is software, it may be multiple software or software modules providing various services for the

terminal devices

101, 102, and 103, or may be a single software or software module providing various services for the

terminal devices

101, 102, and 103, which is not limited by the embodiment of the present disclosure.

The network 105 may be a wired network connected by a coaxial cable, a twisted pair and an optical fiber, or may be a wireless network that can interconnect various Communication devices without wiring, for example, bluetooth (Bluetooth), near Field Communication (NFC), infrared (Infrared), and the like, which is not limited in the embodiment of the present disclosure.

The target user can establish a communication connection with the server 104 via the network 105 through the

terminal devices

101, 102, and 103 to receive or transmit information or the like. It should be noted that the specific types, numbers and combinations of the

terminal devices

101, 102 and 103, the server 104 and the network 105 may be adjusted according to the actual requirements of the application scenario, and the embodiment of the present disclosure does not limit this.

In the related technology, the computational power and storage resources of edge ends such as embedded terminals are limited, only a small model size can be supported, and the recognition accuracy of a general light-weight human face large model to a human face is not high.

In order to solve the technical problem, the embodiment of the present disclosure provides a face recognition scheme, in which a compact and effective lightweight general model for extracting face features is designed, and a face recognition model with real-time response is specially designed for an edge end and an embedded device, so as to improve the accuracy of face recognition.

Specifically, the technical scheme of the embodiment of the disclosure provides a universal attention circulation technology, which can respectively and effectively capture the attention on the space and the channel, and improve the feature discrimination by a channel-by-channel learnable nonlinear mapping mode, and the whole technology can extract an effective feature combination mode, thereby promoting the circulation of the attention on multiple direction dimensions.

Fig. 2 is a schematic flow chart of a face recognition method according to an embodiment of the present disclosure. The method provided by the embodiment of the disclosure can be executed by any electronic equipment with computer processing capability, such as a terminal or a server. As shown in fig. 2, the face recognition method includes:

step S201, a first feature map of a face image to be recognized is obtained.

Specifically, the first feature map is a 4-dimensional tensor whose dimensions are (N, C, H, W), where N represents the number of batch images, C represents the number of channels, H represents the height, and W represents the width. The first feature map is obtained by extracting features of the face image to be recognized.

Step S202, carrying out depth-by-depth convolution processing on the first feature map to obtain a second feature map.

Specifically, a depth-wise Convolution (DWConv) performs Convolution operation in each independent channel, each Convolution kernel performs one calculation for each channel in the conventional Convolution, and each Convolution kernel performs only one calculation for each channel in the depth-wise Convolution.

In step S203, attention diversion processing is performed on the second feature map to obtain a third feature map.

In particular, the attention-flow process may cause attention to flow between space and channels for more efficient feature fusion.

And step S204, performing convolution processing of the increasing channel, attention circulation processing, convolution processing of the decreasing channel and attention circulation processing on the third feature map in sequence to obtain a target feature map corresponding to the first feature map.

Specifically, the convolution processing for increasing the channels and the convolution processing for decreasing the channels are two corresponding conventional convolution calculation processes, the convolution processing for increasing the channels is performed first to increase the number of the channels, and then the convolution processing for decreasing the channels is performed to restore the number of the channels to the previous number.

According to the technical scheme of the embodiment of the disclosure, through attention circulation processing, an effective feature combination mode can be extracted, and attention circulation in multiple direction dimensions is promoted. Through the design and combination of the attention circulation processing technology and different types of convolution, the requirements of a face recognition task and the lightweight class requirements of embedded equipment can be met simultaneously, and compared with the prior art, the method can realize higher recognition accuracy by using less parameter quantity.

As shown in fig. 3, the attention flow processing in step S203 and step S204 includes the steps of:

step S301, a first dimension and a second dimension of the input feature map are flattened to obtain a first intermediate feature map.

In particular, the first dimension may be a height and the second dimension may be a width. Assume that the input feature map is f ₁ A 1 is to f ₁ Is flat (flatten), i.e. the dimension (N, C, H, W) can be transformed into (N, C, R), where R = H × W.

Step S302, a second intermediate feature map is obtained according to the first intermediate feature map and the first learnable parameter matrix.

In the technical solution of the embodiment of the present disclosure, a first product of the first intermediate feature map and a function value of the logistic regression function softmax thereof may be obtained, and then a second intermediate feature map may be obtained according to a mean value of the first product. Specifically, the first intermediate eigen map may be right-multiplied by the first learnable parameter matrix to obtain a tensor, a hadamard product of the softmax function value of the tensor and the tensor is further calculated to obtain a matrix, and the matrix is averaged in a certain dimension to obtain the second intermediate eigen map. The first learnable parameter matrix may learn attention flow information in a spatial dimension.

Step S303, acquiring a spatial attention feature map according to the product of the second intermediate feature map and the input feature map.

Specifically, the spatial attention feature map is a feature map with fused spatial attention.

Step S304, a channel attention feature map is obtained according to the second learnable parameter matrix, the third learnable parameter matrix and the spatial attention feature map, wherein a first dimension of the second learnable parameter matrix is equal to a second dimension of the third learnable parameter matrix, and the first dimension of the third learnable parameter matrix is equal to the second dimension of the second learnable parameter matrix.

Specifically, the spatial attention feature map may be right-multiplied by the second learnable parameter matrix to obtain a second product; and carrying out sparsification processing on the second product, and right-multiplying the third learnable parameter matrix to obtain a channel attention feature map. The second learnable parameter matrix and the third learnable parameter matrix can learn attention circulation information on channel dimensions, and the weight of each channel is learned by capturing feature relations among different channels, so that features are more discriminative for information of each channel.

In step S305, an attention flow feature map is obtained according to the spatial attention feature map and the channel attention feature map.

Specifically, when the attention flow feature map is acquired according to the spatial attention feature map and the channel attention feature map, the spatial attention feature map may be subjected to nonlinear mapping processing to obtain a third intermediate feature map; obtaining a fourth intermediate feature map according to the product of the third intermediate feature map and the channel attention feature map; and carrying out nonlinear mapping processing on the fourth intermediate characteristic diagram to obtain an attention circulation characteristic diagram. According to the attention circulation feature map obtained from the space attention feature map and the channel attention feature map, attention circulation information in the space dimension and the channel dimension can be learned, and therefore accuracy of attention circulation in the space dimension and the channel dimension can be enhanced.

The following is a detailed description of steps S301 to S305:

in step S301, assume that the input feature map is f ₁ Dimension is (N, C, H, W), respectively ₁ The two dimensions H and W of (f) are flattened (flat), and dimension transformation is performed to (N, C, R), so that a second intermediate feature map can be obtained, wherein R = H × W.

To learn attention in the dimension of the feature H x W, such that attention flows in the spatial dimension, in an embodiment of the present disclosure, a first learnable parameter matrix Q is introduced ₁ And the dimension is (R, R) (R < R).

In step S302, the first intermediate feature map obtained after dimension transformation is right-multiplied by Q ₁ To obtain a tensor f 'with dimension (N, C, r)' ₁ F to' ₁ Performing softmax operation in the dimension of r to obtain tensor A with the same dimension of (N, C, r) _s F 'are' ₁ And A _s Multiplying corresponding elements in the dimension r to obtain f' ₁ And A _s The Hadamard product (Hadamard product) of (C, r) to obtain a matrix M of size (N, C, r) ₁ ，M ₁ Representing a fusion of various combinations of features, the greater r, the higher the complexity. Will M ₁ Averaging (avg) according to the dimension r, compressing the dimension to 1 to obtain a second intermediate feature map

The dimensionality is (N, C), and the specific calculation process is shown as the following formula (1):

in the disclosed embodiment, a first learnable parameter matrix Q is introduced ₁ In order to calculate and obtain r space linear transformation results, representative feature combination modes in the space can be extracted. In the extracted face feature mapAlthough each spatial pixel has the same receptive field, the regions of the receptive fields mapped to the original image are different, and the final recognition task is contributed differently, so different weights should be given to different pixels. Using a first learnable parameter matrix Q ₁ Attention in the dimension H x W of the features can be learned, so that attention can be circulated in the spatial dimension to obtain a fusion result of various feature combinations.

In step S303, the second intermediate feature map output in step S301 is processed

And f ₁ Multiplying to obtain space attention feature map

The dimensionality is (N, C, H, W), and the specific calculation process is shown in the following formula (2):

wherein,

namely a feature map with fused spatial attention.

In step S304, a spatial attention feature map with dimensions (N, C, H, W) is introduced into a second learnable parameter matrix Q ₂ And a third learnable parameter matrix Q ₃ Processing to obtain a channel attention feature map

Specifically, the second learnable parameter matrix Q ₂ Has a dimension of (C, C// p), and a third learnable parameter matrix Q ₃ Has a dimension of (C// p, C), wherein C is a natural number. It can be seen that the first dimension of the second learnable parameter matrix is equal to the second dimension of the third learnable parameter matrix, the first dimension of the third learnable parameter matrix is equal to the second learnable parameter momentThe second dimension of the array. Will be provided with

Right multiplying Q ₂ Obtaining the dimension (N, C// p), thinning by relu, and right-multiplying by Q ₃ Obtaining the channel attention feature map

Its dimension is (N, C).

The specific calculation process is shown in the following formula (3):

in step S305, a second learnable parameter matrix Q is introduced to the output in step S304 ₂ And a third learnable parameter matrix Q ₃ Attention circulation information on channel dimensions can be learned, the characteristic relationship among channels is more concerned in the design of the channel, and the weight of each channel is learned by capturing the characteristic relationship among different channels, so that the characteristics are more discriminative for each channel information. p represents a scaling coefficient, and the design parameter p can reduce the calculation amount and control the size of the model.

Road attention feature map

Performing nonlinear mapping to obtain a third intermediate feature fs, wherein the specific calculation process is shown in the following formulas (4) and (5):

wherein,

i represents the ith channel, i.e. to feature map f' ₁ The channel-by-channel nonlinear mapping is carried out, the nonlinear mapping function of each channel can be different, and the mapping parameter epsilon of each channel is _i And k _i Need to be learned.

In the process of data processing by adopting a nonlinear mapping mode, for negative value input, compared with the operation that the input with the relu direct value of 0 or less than 0 is mapped into 0 for output, positive and negative responses of a convolution kernel can be considered to be accepted, namely, the human face needs to learn negative value input. Applying such a non-linear mapping approach can learn more complex relationships in the data. Secondly, it is beneficial to learn the mapping values depth by depth, i.e. to perform channel-independent weight learning, which can be regarded as an attention learning mode among different channels, and enhances the accuracy of attention circulation among channels. In addition, for the channel-by-channel mapping mode, the nonlinear mapping gradually becomes more "nonlinear" in the process of depth deepening, namely, the model tends to keep information in a shallow network, and discriminative power is enhanced in a deep network, namely, the low-layer feature map is generally considered to have high resolution, the semantic information is weak, but the spatial information is rich, and the high-layer feature map has low resolution but the semantic information is strong.

Further, f is ^s And

multiplying to obtain a fourth intermediate characteristic diagram f ^c The dimension is (N, C, H, W), and the specific calculation process is shown in the following formula (6):

to further enhance the expressive power of the features, a fourth intermediate feature map f is applied ^c Carrying out nonlinear mapping to obtain an attention circulation characteristic diagram f ^C The specific calculation process is shown in the following formulas (7) and (8):

wherein,

f ^c feature maps representing that attention is sufficiently circulated in both the spatial and channel directions until the attention flow of interest spans the entire feature space.

From the above, f ^c Has the dimension of (N, C, H, W) and the input feature map f ₁ The dimension is kept consistent, so the attention circulation technology can be used as a plug-and-play module to be inserted into any module and any position of the neural network, and the use mode is flexible. The attention circulation technology is mainly used for carrying out more effective feature fusion through circulation of attention between a space and a channel, and enhancing feature expression capacity through a positive response and a negative response respectively channel by channel learning nonlinear mapping mode, so that more discriminative human face features can be extracted. If we define this attention-flow technique as an SC function, the input is f ₁ Output is f ^C Then the following attention flow equation (9) can be obtained:

f ^C ＝SC(f ₁ ) (9)

in the disclosed embodiment, an attention-flow module may be formed according to the attention-flow technique as a basic constituent module of a neural network. The module can realize the function of extracting the strong discriminant face features by adopting the least calculation amount through carrying out the refined convolution module design aiming at the structural particularity of the face, and effectively focuses the attention of the feature map on the region which is favorable for the recognition task.

When the attention circulation module is applied in step S201 to step S204, the implementation process of step S201 to step S204 may be detailed as follows:

in step S202, a first profile may be performedAnd carrying out depth-by-depth convolution processing, and carrying out batch normalization processing on the depth-by-depth convolution results to obtain a second characteristic diagram. Specifically, a depth-by-depth convolution calculation (DWConv) may be performed with a convolution kernel of n × n (n > 1), the number of input channels of C, the number of output channels of C, a padding (padding) of 1, and a step length (stride) of s, and then a batch normalization (BatchNorm, BN for short) may be performed to obtain a result f' ₁ Taking n =3 as an example, the specific calculation process is shown in the following formula (10):

f’ ₁ ＝BN(DWConv(f ₁ ，3×3)) (10)

the step size varies according to the network design, and is a configurable hyper-parameter. In the embodiment of the disclosure, based on the idea of designing a small-sized module, the parameter amount is reduced by adopting depth-wise convolution instead of ordinary convolution, and the parameter amount of the depth-wise convolution can be calculated to be 1/C of the ordinary convolution. It should be noted that the 3 × 3 convolution may be replaced with a convolution kernel of 5 × 5 or 7 × 7, but the 3 × 3 convolution is most cost effective.

In step S203, the output f 'of step S202 is converted' ₁ The above attention circulation calculation is carried out to obtain

The specific calculation process is shown in the following formula (11):

in step S204, the convolution processing of adding channels includes: performing convolution processing for increasing the channel by N times on the input characteristic diagram, and performing batch normalization processing on the convolution result, wherein N is a natural number; the reduced channel convolution process includes: and carrying out convolution processing with the channels reduced to 1/N on the input feature map, and carrying out batch normalization processing on the convolution result. Specifically, in step S204, the following steps may be performed in order:

output of step S202

Performing convolution calculation (Conv) with convolution kernel of 1 × 1, input channel number of C, output channel number of C × extension and step length of 1, then performing batch normalization, and calculating to obtain result f ₂ The specific calculation process is shown in the following formula (12):

will f is ₂ The above attention circulation calculation is carried out to obtain

The specific calculation process is shown in the following formula (13):

will be provided with

Performing convolution calculation with convolution kernel of 1 × 1, input channel number of C × extension, output channel number of C and step length of 1, then performing batch normalization, and calculating to obtain result f ₃ The specific calculation process is shown in the following formula (14):

will f is mixed ₃ The above attention circulation calculation is carried out to obtain

The specific calculation process is shown in the following formula (15):

the embodiment of the disclosure provides a lightweight attention diversion module, which is designed for face recognition technology in a refined manner, wherein technologies such as convolution design, linear and nonlinear mapping and the like all follow two principles, and firstly, network parameters are reduced, calculation amount is saved, and operation speed is improved; secondly, more effective feature fusion is carried out on the space dimension and the channel dimension, the feature expression capability is enhanced, and the human face features with more discriminative performance are extracted.

The number of basic channels of the attention circulation module in the embodiment of the present disclosure can be designed to be 64, the tensor dimension of the module is very low, the convolution calculation amount of the low-dimension tensor is very small, and a relatively high overall operation speed can be realized. If the whole network is subjected to feature extraction in a low-dimensional space, the incompleteness of information and the unreliability of features are possibly caused, and in the embodiment of the disclosure, the channel number expansion of the set expansion coefficient is performed in the middle convolution processing process, so that the feature extraction capability of the whole module can be improved, and a delicate balance between the calculated amount and the feature expression capability is achieved.

As shown in fig. 4, a face recognition method provided in the embodiments of the present disclosure includes the following steps:

step S401, inputting the face image to be recognized into the convolution layer and the normalization layer with convolution kernel of 3 x 3, channel number of 64 and step length of 1. In one embodiment, the resolution of the face image to be recognized is (1,3, 112, 112). The resolution of the feature map output in step S401 is (1, 64, 112, 112).

Step S402, inputting the feature map obtained in the previous step into an attention diversion module with 1 basic channel number of 64, an expansion coefficient of 1 and a configurable step length of 2. The resolution of the feature map output in step S402 is (1, 64, 56, 56).

Step S403, inputting the feature map obtained in the previous step into an attention circulation module with 1 basic channel number of 64, an expansion coefficient of 1, and a configurable step size of 1. The resolution of the feature map output in step S403 is (1, 64, 56, 56).

Step S404, inputting the feature map obtained in the previous step into 1 attention diversion module with a basic channel number of 64, an expansion coefficient of 2, and a configurable step size of 2. The resolution of the feature map output in step S404 is (1, 64, 28, 28).

Step S405, inputting the feature map obtained in the previous step into an attention circulation module with 4 basic channels, the number of which is 64, the expansion coefficient of which is 2, and the configurable step size of which is 1. The resolution of the feature map output in step S405 is (1, 64, 28, 28).

Step S406, inputting the feature map obtained in the previous step into 1 attention diversion module with a base channel number of 128, an expansion coefficient of 2, and a configurable step size of 2. The resolution of the feature map output in step S406 is (1, 128, 14, 14).

Step S407, the feature map obtained in the previous step is input into the attention circulation module with 6 basic channels, the number of which is 128, the expansion coefficient of which is 2, and the configurable step size of which is 1. The resolution of the feature map output in step S407 is (1, 128, 14, 14).

Step S408, inputting the feature map obtained in the previous step into an attention diversion module with 1 basic channel number of 128, an expansion coefficient of 2, and a configurable step size of 2. The resolution of the feature map output in step S408 is (1, 128,7,7).

Step S409, inputting the feature map obtained in the previous step into the attention circulation module with 2 basic channels, the number of which is 128, the expansion coefficient of which is 2, and the configurable step size of which is 1. The resolution of the feature map output in step S409 is (1, 128,7,7).

Step S410, the feature map obtained in the previous step is input into the convolution layer and normalization layer with convolution kernel of 1 × 1 and channel number of 512. The resolution of the feature map output in step S410 is (1, 512,7,7).

In step S411, the feature map obtained in the previous step is input into the convolution layer and normalization layer with convolution kernel of 7 × 7 and channel number of 512. The resolution of the feature map output in step S411 is (1, 512,1,1).

And step S412, after the characteristic diagram obtained in the previous step is subjected to flattening processing, performing (512 ) full-connection matrix calculation to obtain 512-dimensional vectors as a target characteristic diagram.

In the face recognition method shown in fig. 4, step S402 and step S403 may be regarded as one stage, step S404 and step S405 may be regarded as one stage, step S406 and step S407 may be regarded as one stage, step S408 and step S409 may be regarded as one stage, and the number of the attention circulation modules included in each stage is (2,5,7,3), respectively, but the combination manner of the attention circulation modules is merely an exemplary description, and the technical effect of the technical solution of the embodiment of the present disclosure can be achieved by other combination manners of the attention circulation modules.

The technical scheme of the embodiment of the disclosure provides a universal attention circulation technology, which can respectively and effectively capture the attention on the space and the channel, and improve the feature discrimination by a channel-by-channel learnable nonlinear mapping mode, and the whole technology can extract an effective feature combination mode, thereby promoting the circulation of the attention on dimensions in multiple directions.

According to the face recognition method disclosed by the embodiment of the disclosure, the feature diagram processing of face recognition is performed through the combination of convolution processing and attention circulation processing, the circulation of attention in multiple direction dimensions is promoted, so that the finally obtained feature diagram has higher discrimination for each direction dimension, and the recognition accuracy of a face recognition model is improved.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. The face recognition apparatus described below and the face recognition method described above may be referred to in correspondence with each other. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Fig. 5 is a schematic diagram of a face recognition apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the face recognition apparatus includes:

the obtaining module 501 may be configured to obtain a first feature map of a face image to be recognized.

Specifically, the first feature map is a 4-dimensional tensor having dimensions (N, C, H, W), where N represents the number of batch images, C represents the number of channels, H represents the height, and W represents the width. The first feature map is obtained by extracting features of the face image to be recognized.

The convolution module 502 may be configured to perform depth-by-depth convolution processing on the first feature map to obtain a second feature map.

Specifically, the depth-by-depth convolution performs convolution operations in each individual channel, with each convolution kernel performing a computation on each channel in conventional convolution, and with each convolution kernel performing a computation on only one channel in depth-by-depth convolution.

The attention circulation module 503 may be configured to perform attention circulation processing on the second feature map to obtain a third feature map.

The hybrid processing module 504 may be configured to perform convolution processing of increasing channels, attention circulation processing, convolution processing of decreasing channels, and attention circulation processing on the third feature map in sequence to obtain a target feature map corresponding to the first feature map.

According to the technical scheme of the embodiment of the disclosure, through attention circulation processing, an effective feature combination mode can be extracted, and circulation of attention in multiple direction dimensions is promoted. Through the design and combination of the attention circulation processing technology and different types of convolution, the requirements of a face recognition task and the lightweight class requirements of embedded equipment can be met simultaneously, and compared with the prior art, the method can realize higher recognition accuracy by using less parameter quantity.

In this embodiment of the present disclosure, the attention diversion module 503 may be further configured to perform flattening processing on a first dimension and a second dimension of the input feature map to obtain a first intermediate feature map; acquiring a second intermediate characteristic diagram according to the first intermediate characteristic diagram and the first learnable parameter matrix; acquiring a spatial attention feature map according to the product of the second intermediate feature map and the input feature map; acquiring a channel attention feature map according to a second learnable parameter matrix, a third learnable parameter matrix and a space attention feature map, wherein a first dimension of the second learnable parameter matrix is equal to a second dimension of the third learnable parameter matrix, and the first dimension of the third learnable parameter matrix is equal to the second dimension of the second learnable parameter matrix; and acquiring an attention circulation characteristic diagram according to the spatial attention characteristic diagram and the channel attention characteristic diagram.

In the technical solution of the embodiment of the present disclosure, a first product of the first intermediate feature map and a function value of the logistic regression function softmax thereof may be obtained, and then a second intermediate feature map may be obtained according to a mean value of the first product. Specifically, the first intermediate eigen map may be right-multiplied by the first learnable parameter matrix to obtain a tensor, a hadamard product of the softmax function value of the tensor and the tensor is further calculated to obtain a matrix, and the matrix is averaged in a certain dimension to obtain the second intermediate eigen map.

Specifically, the spatial attention feature map is a feature map with fused spatial attention. The first learnable parameter matrix may learn attention flow information in a spatial dimension. The second learnable parameter matrix and the third learnable parameter matrix can learn attention circulation information on channel dimensions, and the weight of each channel is learned by capturing feature relations among different channels, so that features are more discriminative for information of each channel. According to the attention circulation feature map obtained from the space attention feature map and the channel attention feature map, attention circulation information in the space dimension and the channel dimension can be learned, and therefore accuracy of attention circulation in the space dimension and the channel dimension can be enhanced.

In this embodiment of the present disclosure, the attention diverting module 503 may be further configured to perform a nonlinear mapping process on the spatial attention feature map to obtain a third intermediate feature map; obtaining a fourth intermediate feature map according to the product of the third intermediate feature map and the channel attention feature map; and carrying out nonlinear mapping processing on the fourth intermediate characteristic diagram to obtain an attention circulation characteristic diagram.

In the disclosed embodiment, the application of such a non-linear mapping approach can learn more complex relationships in the data. It is beneficial to learn the mapping values depth by depth, i.e. to perform channel-independent weight learning, which can be regarded as a way of learning attention between different channels, and enhances the accuracy of attention flow between channels. In addition, for the channel-by-channel mapping mode, the nonlinear mapping gradually becomes more "nonlinear" in the process of depth deepening, namely, the model tends to keep information in a shallow network, and discriminative power is enhanced in a deep network, namely, the low-layer feature map is generally considered to have high resolution, the semantic information is weak, but the spatial information is rich, and the high-layer feature map has low resolution but the semantic information is strong.

In the embodiment of the present disclosure, the attention diversion module 503 may be further configured to obtain a first product of the first intermediate feature map and a logistic regression function value thereof; and acquiring a second intermediate characteristic diagram according to the mean value of the first product.

In this embodiment of the present disclosure, the attention diversion module 503 may be further configured to right-multiply the spatial attention feature map by the second learnable parameter matrix to obtain a second product; and carrying out sparsification processing on the second product, and right-multiplying the third learnable parameter matrix to obtain a channel attention feature map.

In the disclosed embodiment, a first learnable parameter matrix Q is introduced ₁ In order to calculate and obtain r space linear transformation results, representative feature combination modes in the space can be extracted. In the extracted face feature map, although each spatial pixel point has the same receptive field, the regions of the receptive fields mapped to the original image are different, and the final recognition task is contributed differently, so different weights should be given to different pixel points. Using the first learnableParameter matrix Q ₁ Attention in the dimension H x W of the features can be learned, so that attention can be circulated in the spatial dimension to obtain a fusion result of various feature combinations. Introducing a second learnable parameter matrix Q ₂ And a third learnable parameter matrix Q ₃ Attention circulation information on channel dimensions can be learned, the characteristic relationship among channels is more concerned in the design of the channel, and the weight of each channel is learned by capturing the characteristic relationship among different channels, so that the characteristics are more discriminative for each channel information.

In this embodiment of the disclosure, the hybrid processing module 504 may be further configured to add convolution processing of channels including: performing convolution processing for increasing the channel by N times on the input characteristic diagram, and performing batch normalization processing on the convolution result, wherein N is a natural number; the reduced channel convolution process includes: and carrying out convolution processing with the channels reduced to 1/N on the input feature map, and carrying out batch normalization processing on the convolution result.

In this embodiment of the present disclosure, the convolution module 502 may be further configured to perform depth-by-depth convolution processing on the first feature map, and perform batch normalization processing on the depth-by-depth convolution result to obtain a second feature map.

The embodiment of the disclosure provides a lightweight attention diversion module, which performs fine design for a face recognition technology, wherein technologies such as convolution design, linear and nonlinear mapping and the like all follow two principles, and firstly, network parameters are reduced, the calculation amount is saved, and the operation speed is improved; and secondly, more effective feature fusion is carried out on the space dimension and the channel dimension, the feature expression capability is enhanced, and more discriminative human face features are extracted.

In the embodiment of the disclosure, the whole attention circulation module makes attention flow concerned by a face recognition task circulate and transform between space and channels through the combination of different types of convolution, expansion and compression of channel numbers, attention circulation technologies and other operations, the feature fusion is more efficient, the feature map is finally and effectively focused on the region interested by face recognition, and in addition, the attention circulation module has the advantages of small parameter number, small calculated amount and high speed.

The technical scheme of the embodiment of the disclosure provides a universal attention circulation technology, which can respectively and effectively capture the attention on the space and the channel, improve the feature discrimination through a channel-by-channel learnable nonlinear mapping mode, and can extract an effective feature combination mode to promote the circulation of the attention on multiple direction dimensions.

As each functional module of the face recognition apparatus in the exemplary embodiment of the present disclosure corresponds to the step of the exemplary embodiment of the face recognition method, please refer to the embodiment of the face recognition method in the present disclosure for details that are not disclosed in the embodiment of the apparatus in the present disclosure.

According to the face recognition device disclosed by the embodiment of the disclosure, the feature map processing of face recognition is performed through the combination of convolution processing and attention circulation processing, the circulation of attention in multiple direction dimensions is promoted, so that the finally obtained feature map has high discrimination for each direction dimension, and the recognition accuracy of a face recognition model is improved.

Fig. 6 is a schematic diagram of an electronic device 6 provided by an embodiment of the present disclosure. As shown in fig. 6, the electronic apparatus 6 of this embodiment includes: a processor 601, a memory 602, and a computer program 603 stored in the memory 602 and executable on the processor 601. The steps in the various method embodiments described above are implemented when the processor 601 executes the computer program 603. Alternatively, the processor 601 implements the functions of the respective modules in the above-described respective apparatus embodiments when executing the computer program 603.

The electronic device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other electronic devices. The electronic device 6 may include, but is not limited to, a processor 601 and a memory 602. Those skilled in the art will appreciate that fig. 6 is merely an example of an electronic device 6, and does not constitute a limitation of the electronic device 6, and may include more or less components than those shown, or different components.

The Processor 601 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like.

The storage 602 may be an internal storage unit of the electronic device 6, for example, a hard disk or a memory of the electronic device 6. The memory 602 may also be an external storage device of the electronic device 6, for example, a plug-in hard disk provided on the electronic device 6, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. The memory 602 may also include both internal and external storage units of the electronic device 6. The memory 602 is used for storing computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit.

The integrated module, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method in the above embodiments, and may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above methods and embodiments. The computer program may comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, software distribution medium, etc. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.

The above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present disclosure, and are intended to be included within the scope of the present disclosure.

Claims

1. A face recognition method, comprising:

acquiring a first feature map of a face image to be recognized;

carrying out depth-by-depth convolution processing on the first feature map to obtain a second feature map;

performing attention circulation processing on the second feature map to obtain a third feature map;

and sequentially performing convolution processing of increasing channels, attention circulation processing, convolution processing of reducing channels and attention circulation processing on the third feature diagram to obtain a target feature diagram corresponding to the first feature diagram.

2. The method of claim 1, wherein the attention diversion process comprises:

leveling a first dimension and a second dimension of the input feature map to obtain a first intermediate feature map;

acquiring a second intermediate characteristic diagram according to the first intermediate characteristic diagram and the first learnable parameter matrix;

acquiring a spatial attention feature map according to the product of the second intermediate feature map and the input feature map;

acquiring a channel attention feature map according to a second learnable parameter matrix, a third learnable parameter matrix and the spatial attention feature map, wherein a first dimension of the second learnable parameter matrix is equal to a second dimension of the third learnable parameter matrix, and the first dimension of the third learnable parameter matrix is equal to the second dimension of the second learnable parameter matrix;

and acquiring an attention circulation characteristic diagram according to the space attention characteristic diagram and the channel attention characteristic diagram.

3. The method of claim 2, wherein obtaining an attention flow profile from the spatial attention profile and the channel attention profile comprises:

carrying out nonlinear mapping processing on the spatial attention feature map to obtain a third intermediate feature map;

obtaining a fourth intermediate feature map according to the product of the third intermediate feature map and the channel attention feature map;

and carrying out the nonlinear mapping processing on the fourth intermediate characteristic diagram to obtain the attention flow circulation characteristic diagram.

4. The method of claim 2, wherein obtaining a second intermediate feature map from the first intermediate feature map and a first learnable parameter matrix comprises:

obtaining a first product of the first intermediate feature map and a logistic regression function value thereof;

and acquiring the second intermediate characteristic diagram according to the mean value of the first product.

5. The method of claim 2, wherein obtaining a channel attention feature map from a second learnable parameter matrix, a third learnable parameter matrix, and the spatial attention feature map comprises:

multiplying the spatial attention feature map by the second learnable parameter matrix to obtain a second product;

and carrying out sparsification processing on the second product, and right-multiplying the third learnable parameter matrix to obtain the channel attention feature map.

6. The method of claim 1,

the convolution processing of the added channel comprises: performing convolution processing for increasing the channel by N times on the input characteristic diagram, and performing batch normalization processing on the convolution result, wherein N is a natural number;

the reduced channel convolution processing includes: and (4) carrying out convolution processing with the channel reduced to 1/N on the input feature map, and carrying out batch normalization processing on the convolution result.

7. The method of claim 6, wherein obtaining a second feature map according to the depth-wise convolution processing on the first feature map comprises:

and carrying out depth-by-depth convolution processing on the first feature map, and carrying out batch normalization processing on depth-by-depth convolution results to obtain the second feature map.

8. An apparatus for face recognition, the apparatus comprising:

the acquisition module is used for acquiring a first feature map of a face image to be recognized;

the convolution module is used for carrying out depth-by-depth convolution processing on the first feature map to obtain a second feature map;

the attention circulation module is used for carrying out attention circulation processing on the second feature map to obtain a third feature map;

and the mixed processing module is used for sequentially performing convolution processing of increasing channels, attention circulation processing, convolution processing of reducing channels and attention circulation processing on the third feature map to obtain a target feature map corresponding to the first feature map.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.