CN116563909A

CN116563909A - Human face recognition method of visual semantic interaction module based on fusion attention mechanism

Info

Publication number: CN116563909A
Application number: CN202310243882.3A
Authority: CN
Inventors: 庞志刚; 王波; 杨巨成; 王伟; 国英龙; 陈燕; 贾智洋; 孙笑; 徐振宇; 魏峰; 赵婷婷; 王嫄; 潘旭冉
Original assignee: Baotou Yihui Information Technology Co ltd
Current assignee: Baotou Yihui Information Technology Co ltd
Priority date: 2023-03-15
Filing date: 2023-03-15
Publication date: 2023-08-08

Abstract

The invention provides a face recognition method of a visual semantic interaction module based on a fusion attention mechanism, which comprises the following steps: s1: acquiring a face data set containing text description; s2: a visual semantic interaction module integrating an attention mechanism is adopted to construct a face recognition network model based on knowledge guidance; s3: initializing a face recognition network model established in the step S2, selecting an optimizer, and setting parameters of network training; s4: optimizing a face recognition network model by using the loss function and storing; s5: and loading an optimal face recognition network model generated in the training process, acquiring a test data set, inputting the test data set into the network model, and generating a corresponding face recognition result. The human face recognition method based on the visual semantic interaction module with the attention fusion mechanism extracts the visual features and the text features respectively, better extracts visual knowledge through visual semantic interaction, and improves the accuracy of human face recognition.

Description

Human face recognition method of visual semantic interaction module based on fusion attention mechanism

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a human face recognition method of a visual semantic interaction module based on a fusion attention mechanism.

Background

In the present informatization age, how to accurately identify a person and protect information security becomes a key social problem which must be solved. Traditional identity authentication is extremely easy to forge and lose, so that social requirements are more and more difficult to meet, and the most convenient and safe solution is definitely a biological characteristic image recognition technology at present. Compared with other biological feature recognition technologies such as fingerprints and irises, the face recognition technology has the advantages of intuitiveness, non-contact property, convenience in acquisition, strong interactivity and expandability, and becomes a very popular research field.

At present, the face recognition research based on deep learning is popular, and the process mainly comprises the steps of face preprocessing, feature learning and feature comparison. The feature learning is a key of face recognition, and the accuracy of face recognition mainly depends on the adopted network architecture and the loss function. The current mainstream network architecture is provided with a model VGGNet, ***Net, resNet and the like, and in addition, the selection of a proper loss function is favorable for distinguishing face images of different categories in a feature space, so that the face recognition accuracy is improved. Therefore, many scholars have developed a series of researches on the loss function of face recognition, for example, arcFace loss proposed by Deng et al in 2019, groupFace loss proposed by Yonghyun Kim et al in 2020, magFace loss proposed by Meng et al in 2021, and the like, so that the performance of face recognition is significantly improved.

Although deep learning-based face recognition has achieved significant improvements, there are challenges: firstly, the face recognition performance under various extreme conditions is still not ideal enough, such as an age span, a large gesture span, extreme illumination conditions, low resolution and the like; secondly, efficient training of CNNs requires extensive training data, and deep neural network-based models once trained and formed become an end-to-end regularized mapping function, lacking in interpretability and transparency. The deep learning model can not display a decision-making reasoning process in the identification process only by means of data driving and lack of 'knowledge' guidance, and the effectiveness and the robustness of the deep learning model are also to be improved.

The introduction of knowledge can effectively alleviate the problems faced by the data driving model, increase the interpretability of the model, reduce the dependence of the model on data quantity and increase the robustness of the model. Knowledge is divided into explicit knowledge and implicit knowledge, wherein general explicit knowledge comprises explicit knowledge such as word meaning interpretation, semantic relation and the like in knowledge bases such as semantic dictionary, semantic network, knowledge map and the like, and the knowledge is commonly called as common knowledge, world knowledge and the like; implicit knowledge refers to knowledge that is difficult to specify but is beneficial to model understanding, such as sample properties, context information, causal relationships, mined features, etc. used in the model training process, which is generally domain law, and needs to be generated and acted upon in a specific scenario.

Some research work has been done on knowledge-guided deep learning models. For example, in a mode of combining images and text information proposed in 2017, some encyclopedic text information is used as external auxiliary information, wherein a visual stream is a common deep convolutional neural network, a text stream is a process of extracting text features in a mode of combining CNN and RNN and mutually matching and learning with the features of the deep convolutional neural network, and the last two streams jointly act on fine-grained image classification. Face recognition is a hot spot of research at home and abroad in recent years, and although a large number of face recognition methods are proposed by researchers at present, the face recognition based on knowledge guidance is still rarely researched, and continuous exploration and perfection are required.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a face recognition method based on a visual semantic interaction module fused with an attention mechanism so as to improve the accuracy and the robustness of a face recognition system.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a face recognition method of a visual semantic interaction module based on a fusion attention mechanism comprises the following steps:

s1: acquiring a face data set containing text description;

s2: a visual semantic interaction module integrating an attention mechanism is adopted to construct a face recognition network model based on knowledge guidance, and the face recognition network model comprises three parts of text feature extraction, visual feature extraction and visual semantic interaction;

s3: initializing a face recognition network model established in the step S2, selecting an optimizer, and setting parameters of network training;

s4: optimizing a face recognition network model by using the loss function and storing;

s5: and loading an optimal face recognition network model generated in the training process, acquiring a test data set, inputting the test data set into the network model, and generating a corresponding face recognition result.

Further, in step S1, a training set and a testing set are divided by using a large face image dataset Multi-Modal-CelebA-HQ. The data set selects high-resolution face images from CelebA data set, and each image has corresponding text description.

Further, in step S2, the text feature extraction uses a long-short-term memory network LSTM, the visual feature extraction uses a residual network res net, and the visual semantic interaction consists of a modal delivery and a plurality of attention modules.

Further, step S3 establishes a network model by using a pytorch framework, initializes the weight of the network model, selects a random gradient descent SGD optimizer for training, and initializes the learning rate.

Further, in step S4, a MagFace loss function is adopted in the visual feature extraction portion, a cross entropy loss function is adopted in the text feature extraction portion, and the model is jointly optimized through the two loss functions.

The invention also provides a human face recognition device based on the visual semantic interaction module fusing the attention mechanism, which comprises:

a data set module: acquiring a face data set containing text description;

the construction module comprises: a visual semantic interaction module integrating an attention mechanism is adopted to construct a face recognition network model based on knowledge guidance, and the face recognition network model comprises a text feature extraction module, a visual feature extraction module and a visual semantic interaction module;

an initialization module: initializing a face recognition network model established in the step S2, selecting an optimizer, and setting parameters of network training;

a loss function module: optimizing a face recognition network model by using the loss function and storing;

and an identification module: and loading an optimal face recognition network model generated in the training process, acquiring a test data set, inputting the test data set into the network model, and generating a corresponding face recognition result.

Furthermore, the data set module adopts a large-scale face image data set Multi-Modal-CelebA-HQ to divide a training set and a testing set. The data set selects high-resolution face images from CelebA data set, and each image has corresponding text description.

Furthermore, the text feature extraction module adopts a long-short-term memory network LSTM, the visual feature extraction module adopts a residual network ResNet, and the visual semantic interaction module consists of a modal transmission module and a plurality of attention modules.

Furthermore, a pytorch frame is adopted in the initialization module to build a network model, the weight of the network model is initialized, a random gradient descent SGD optimizer is selected for training, and the learning rate is initialized.

Furthermore, the loss function module adopts a MagFace loss function in the visual feature extraction part, and adopts a cross entropy loss function in the text feature extraction part, so that the model is jointly optimized through the two loss functions.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a human face recognition method based on a visual semantic interaction module fused with an attention mechanism, which is used for respectively extracting visual features and text features, better extracting visual knowledge through visual semantic interaction and improving the accuracy of human face recognition.

2. The visual semantic interaction provided by the invention is based on the attention mechanism, the visual knowledge is extracted in a segmented way by using a plurality of attention units, and finally the average value is obtained to obtain the visual information after text guidance, so that the accuracy and the robustness of the face recognition system can be further improved.

Drawings

Fig. 1 is a flowchart of a face recognition method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of the overall structure of face recognition according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an attention network of an embodiment of the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

For the purpose of making the objects and features of the present invention more comprehensible, embodiments accompanied with figures are described in detail below. It should be noted that the drawings are in a very simplified form and use non-precise ratios for convenience and clarity in assisting in the description of the embodiments of the invention.

Fig. 1 shows a flow chart of implementation of a face recognition method of a visual semantic interaction module based on a fusion attention mechanism, which comprises the following steps:

step 1: a face dataset is obtained that contains a textual description.

A dataset Multi-Modal-CelebA-HQ was used, which was 30000 high resolution face images selected from the CelebA dataset, each image having a corresponding textual description. The training set and the test set are divided according to the ratio of 7:3, wherein the training set comprises 21000 face images, the test set comprises 9000 face images, and 10 text descriptions are selected for each image.

Step 2: and a visual semantic interaction module fused with an attention mechanism is adopted to construct a face recognition model based on knowledge guidance.

As shown in fig. 2, the face recognition model based on knowledge guidance is constructed mainly by three parts: the system comprises a text feature extraction module, a visual feature extraction module and a visual semantic interaction module. The text feature extraction module adopts a long-short-term memory network (LSTM) to extract text features, and the network can learn long dependency relationships and better process time sequence tasks. Then, a discard rate r is defined to represent the probability of randomly discarding real samples, which can enable the network to randomly discard some information during training to improve model generalization capability.

The visual characteristic extraction module firstly carries out face detection and face alignment on an input image, then adopts a residual error network (ResNet-100) to extract visual characteristics, the ultra-deep network structure can improve the representation capability of the network, and meanwhile, the residual error module can prevent gradient disappearance or gradient explosion.

The visual semantic interaction module consists of a modal transfer and a plurality of attention modules, wherein a modal transfer function is realized by a two-layer fully-connected network with trainable parameters, and after the global visual characteristics of an image are given, the function can approximately describe the simulated text codes of the image content, so that asynchronous training and testing behaviors are realized. After the model is trained, if only the image is input, the text coding of the image content can be simulated through the mode transfer function, so that the image recognition can be performed without considering the text availability.

The scaling dot product attention mechanism is defined as equation (1):

wherein Q is a Query vector matrix (Query), K is a Key vector matrix (Key), K ^T Is the transposed matrix of K, V is the Value vector matrix (Value), d _k As the dimension of the vector k,for the scaling factor, the product result is prevented from being too large.

The attention unit network of the visual semantic interaction module is shown in fig. 3, and is inspired by a scaling dot product attention mechanism, which is defined as a formula (2):

f＝A(S ^T W _q ,V ^T W _k ,V ^T W _v )W _f (2)

wherein V is a matrix of visual features of a face image, V ^T Is the transposed matrix of V; s is a matrix of text features described by the corresponding sentence, S ^T Is the transposed matrix of S; w (W) _q 、W _k 、W _v Is a learnable parameterA matrix mapping S and V to three different vector spaces Q, K, V, respectively. The query vector matrix Q and the key vector matrix K are subjected to scaling dot product operation to obtain scoresAnd then, carrying out softmax operation on the score to obtain a weight distribution matrix, and finally multiplying the weight distribution matrix by a value vector matrix V to obtain an output matrix A. W (W) _f For a learnable parameter matrix, the output a can be mapped back to the original dimension to obtain the fusion feature f. Obtaining n fusion features f through n attention networks ₁ …f _n (n is an empirically selected radix factor) and then the n fused features are averaged to obtain a fused feature vector F. The fusion feature vector is defined as formula (3):

wherein, the liquid crystal display device comprises a liquid crystal display device,respectively f ₁ …f _n Delta represents global average pooling operation, and finally a fusion feature vector F with the dimension of 2 is obtained.

After the fusion feature vector F is obtained, the fusion feature vector F is input into a classifier formed by two layers of fully-connected neural networks, and a final output result is obtained.

Step 3: initializing a network model, selecting an optimizer, and setting parameters of network training.

And (3) establishing a network model by adopting a pytorch framework, initializing the weight of the network model, selecting a random gradient descent (SGD) optimizer for training, setting a mini-batch to 64, and dynamically reducing the learning rate from 0.001 to 0.00015.

Step 4: the network model is optimized and saved using the loss function.

The visual feature extraction module adopts a MagFace loss function, the loss function introduces a self-adaptive mechanism, and the compactness in the class is enhanced by pulling a high-quality sample to the class center and pushing a low-quality sample away, so that the face recognition capability is improved. The MagFace loss function is defined as equation (4):

wherein N is the number of face samples of a training batch, and exceeds a parameter lambda _g For balancing classification loss and regularization loss. In the classification loss, s is the scaling parameter,is weight->And feature x _i Included angle between (a) and m (a) _i ) Is an angle edge penalty function, and the boundary can be dynamically adjusted according to the feature size. g (a) _i ) As a regularization function, low quality samples can be pushed toward the boundaries of the feasible region, and high quality samples pulled toward the center of the class.

The text feature extraction module adopts a cross entropy loss function, which is defined as a formula (5):

where N is the number of samples, y _i Tag for sample i (positive class 1, negative class 0), p _i The probability of being a positive class is predicted for sample i.

This step optimizes the model shown in fig. 2 by the two above loss functions together, the total loss of which is defined as equation (6):

L＝L _Mag +λL _PE (6)

wherein λ is the hyper-parameter.

Step 5: and loading an optimal network model generated in the training process, acquiring a test data set, inputting the test data set into the network model, and generating a corresponding face recognition result.

The test dataset was the post 9000 images and text in the Multi-Modal-CelebA-HQ dataset. During model test, only an image or an image-text pair can be input for test, and a corresponding face recognition result is obtained.

Step 6: an evaluation index is calculated to evaluate the performance of the network model.

And (5) calculating an evaluation index false recognition rate (FAR), a rejection rate (FRR) and an ROC curve according to the face recognition result generated in the step (5), and evaluating the performance of the network model.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. The face recognition method of the visual semantic interaction module based on the fusion attention mechanism is characterized by comprising the following steps of:

s1: acquiring a face data set containing text description;

2. The method for recognizing human face based on the visual semantic interaction module of the fusion attention mechanism according to claim 1, wherein the step S1 adopts a large-scale human face image dataset Multi-Modal-CelebA-HQ to divide a training set and a testing set. The data set selects high-resolution face images from CelebA data set, and each image has corresponding text description.

3. The method for recognizing human face based on the visual semantic interaction module of the fusion attention mechanism according to claim 1, wherein in the step S2, the text feature extraction adopts a long-short-term memory network LSTM, the visual feature extraction adopts a residual network res net, and the visual semantic interaction consists of modal transmission and a plurality of attention modules.

4. The face recognition method of the visual semantic interaction module based on the fusion attention mechanism according to claim 1, wherein step S3 is characterized in that a pytorch framework is adopted to build a network model, the weight of the network model is initialized, a random gradient descent SGD optimizer is selected for training, and the learning rate is initialized.

5. The face recognition method of a visual semantic interaction module based on a fused attention mechanism according to claim 1, wherein step S4 uses MagFace loss functions in a visual feature extraction part, and uses cross entropy loss functions in a text feature extraction part, and the model is optimized by both loss functions.

6. A human face recognition device based on a visual semantic interaction module fusing attention mechanisms, which is characterized by comprising:

a data set module: acquiring a face data set containing text description;

7. The human face recognition device based on the visual semantic interaction module of the fusion attention mechanism according to claim 6, wherein the data set module adopts a large-scale human face image data set Multi-Modal-CelebA-HQ to divide a training set and a testing set. The data set selects high-resolution face images from CelebA data set, and each image has corresponding text description.

8. The human face recognition device based on a visual semantic interaction module with a fused attention mechanism according to claim 6, wherein the text feature extraction module adopts a long-short-term memory network LSTM, the visual feature extraction module adopts a residual network ResNet, and the visual semantic interaction module consists of a modal transmission module and a plurality of attention modules.

9. The human face recognition device based on the visual semantic interaction module of the fusion attention mechanism, which is characterized by comprising a pytorch framework, wherein the initialization module is used for establishing a network model, initializing the weight of the network model, selecting a random gradient descent SGD optimizer for training, and initializing the learning rate.

10. The human face recognition device based on the visual semantic interaction module of the fusion attention mechanism according to claim 6, wherein the loss function module adopts a MagFace loss function in a visual feature extraction part, a text feature extraction part adopts a cross entropy loss function, and the model is optimized through the two loss functions together.