CN117292443A

CN117292443A - Multi-mode recognition system and method for fusing human face and palm vein

Info

Publication number: CN117292443A
Application number: CN202311241558.4A
Authority: CN
Inventors: 金泽; 赵天明; 周伟; 李臣明
Original assignee: Hangzhou Mingguang Microelectronics Technology Co ltd
Current assignee: Hangzhou Mingguang Microelectronics Technology Co ltd
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2023-12-26
Anticipated expiration: 2043-09-25
Also published as: CN117292443B

Abstract

A multi-modal recognition system and method for fusing human face and palm vein are disclosed. Firstly, acquiring a face image and a palm vein distribution image of a user to be verified, then, extracting image features of the face image and the palm vein distribution image to obtain a face feature vector and a palm vein distribution feature vector, then, carrying out feature interaction on the face feature vector and the palm vein distribution feature vector to obtain a multi-mode biological feature vector, and finally, determining whether the user to be verified is an authorized user or not based on the multi-mode biological feature vector. Thus, the complementarity of the two biological characteristics can be fully utilized to carry out multi-mode identification, and identification failure caused by the influence of environment or individual change on a single biological characteristic is avoided.

Description

Multi-mode recognition system and method for fusing human face and palm vein

Technical Field

The present disclosure relates to the field of intelligent door locks, and more particularly, to a multi-modality recognition system fusing a human face and a palm vein and a method thereof.

Background

Different from traditional mechanical locks, the intelligent door lock is safer, more convenient and more advanced, and the intelligent door lock on the market at present is provided with magnetic cards, radio frequency cards and other non-biological identification types, and also is provided with fingerprint, face, iris and other biological identification type door locks based on biological identification, and the safety of the biological identification type door lock is higher and more convenient.

However, the intelligent door lock in the market is single in biological recognition, and after the face or the fingerprint and other characteristics are input, the door lock can be unlocked only according to the input, and unexpected situations, such as failure in comparison of the door lock recognition biological characteristics caused by face makeup or finger injury, can not be unlocked. Thus, an optimized multimodal recognition scheme is desired.

Disclosure of Invention

In view of this, the present disclosure provides a multi-modal recognition system and method for fusing a human face and a palm vein, which can make full use of complementarity of two biological features to perform multi-modal recognition, so as to avoid recognition failure caused by influence of environmental or individual changes on a single biological feature.

According to an aspect of the present disclosure, there is provided a multi-modality recognition method of fusing a human face and a palm vein, including:

acquiring a face image and a palm vein blood vessel distribution image of a user to be verified;

extracting image features of the face image and the palm vein blood vessel distribution image to obtain a face feature vector and a palm vein blood vessel distribution feature vector;

performing feature interaction on the face feature vector and the palm vein blood vessel distribution feature vector to obtain a multi-mode biological feature vector; and

And determining whether the user to be verified is an authorized user based on the multi-modal biometric vector.

According to another aspect of the present disclosure, there is provided a multi-modality recognition system fusing a human face and a palm vein, including:

the image acquisition module is used for acquiring a face image and a palm vein blood vessel distribution image of a user to be verified;

the image feature extraction module is used for extracting image features of the face image and the palm vein blood vessel distribution image to obtain a face feature vector and a palm vein blood vessel distribution feature vector;

the characteristic interaction module is used for carrying out characteristic interaction on the face characteristic vector and the palm vein blood vessel distribution characteristic vector to obtain a multi-mode biological characteristic vector; and

and the user verification module is used for determining whether the user to be verified is an authorized user or not based on the multi-mode biological feature vector.

According to the embodiment of the disclosure, a face image and a palm vein distribution image of a user to be verified are firstly obtained, then image feature extraction is carried out on the face image and the palm vein distribution image to obtain a face feature vector and a palm vein distribution feature vector, then feature interaction is carried out on the face feature vector and the palm vein distribution feature vector to obtain a multi-mode biological feature vector, and finally whether the user to be verified is an authorized user is determined based on the multi-mode biological feature vector. Thus, the complementarity of the two biological characteristics can be fully utilized to carry out multi-mode identification, and identification failure caused by the influence of environment or individual change on a single biological characteristic is avoided.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flowchart of a multi-modality recognition method of fusing a human face and a palm vein according to an embodiment of the present disclosure.

Fig. 2 shows an architectural diagram of a multi-modality recognition method of fusing a human face and a palm vein according to an embodiment of the present disclosure.

Fig. 3 shows a flowchart of sub-step S120 of a multi-modality recognition method of fusing a human face and a palm vein according to an embodiment of the present disclosure.

Fig. 4 shows a flowchart of sub-step S122 of the multimodal recognition method of merging a face and a palm vein according to an embodiment of the present disclosure.

Fig. 5 shows a flowchart of sub-step S130 of a multi-modality recognition method of fusing a human face and a palm vein according to an embodiment of the present disclosure.

Fig. 6 shows a flowchart of sub-step S131 of the multimodal recognition method of merging a face and a palm vein according to an embodiment of the present disclosure.

Fig. 7 shows a block diagram of a multimodal recognition system that fuses a face and a palm vein in accordance with an embodiment of the present disclosure.

Fig. 8 illustrates an application scenario diagram of a multimodal recognition method fusing a human face and a palm vein according to an embodiment of the present disclosure.

Fig. 9 shows a schematic diagram of an identification flow based on a palmar vein identification instruction according to an embodiment of the present disclosure.

Fig. 10 illustrates a schematic diagram of a recognition flow based on face recognition instructions according to an embodiment of the present disclosure.

Fig. 11 illustrates a schematic diagram of a hybrid recognition instruction-based recognition flow according to an embodiment of the present disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the disclosure. All other embodiments, which can be made by one of ordinary skill in the art without undue burden based on the embodiments of the present disclosure, are also within the scope of the present disclosure.

As used in this disclosure and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Aiming at the technical problems, the technical conception of the present disclosure is as follows: based on the face information and the palm vein information of the user, the complementarity of the two biological characteristics is fully utilized to carry out multi-mode identification, and the identification failure caused by the influence of the environment or individual change on the single biological characteristic is avoided.

Based on this, fig. 1 shows a flowchart of a multimodal recognition method of fusing a face and a palm vein according to an embodiment of the present disclosure. Fig. 2 shows an architectural diagram of a multi-modality recognition method of fusing a human face and a palm vein according to an embodiment of the present disclosure. As shown in fig. 1 and 2, a multi-modal identification method for fusing a human face and a palm vein according to an embodiment of the present disclosure includes the steps of: s110, acquiring a face image and a palm vein blood vessel distribution image of a user to be verified; s120, extracting image features of the face image and the palm vein blood vessel distribution image to obtain a face feature vector and a palm vein blood vessel distribution feature vector; s130, performing feature interaction on the face feature vector and the palm vein blood vessel distribution feature vector to obtain a multi-mode biological feature vector; and S140, determining whether the user to be verified is an authorized user or not based on the multi-mode biometric vector.

Specifically, in the technical scheme of the present disclosure, first, a face image and a palmar vein distribution image of a user to be authenticated are acquired. And then, extracting image features of the face image and the palm vein blood vessel distribution image to obtain face feature vectors and palm vein blood vessel distribution feature vectors.

It should be appreciated that both the face image and the palm vein vascularity image have unique image features. Specifically, the face image contains feature information of face outline, including the shape and position of forehead, eyebrow, eyes, nose, mouth and other parts; the facial image contains eye features including the size, shape, position, orbit, corner, etc. of the eyes. The palm vein blood vessel distribution image contains characteristic information of a blood vessel network.

In a specific example of the present disclosure, as shown in fig. 3, the encoding process for extracting image features of the face image and the palmar vein distribution image to obtain a face feature vector and a palmar vein distribution feature vector includes: s121, the face image passes through a face feature extractor based on a convolutional neural network model to obtain the face feature vector; and S122, passing the palm vein blood vessel distribution image through a palm vein blood vessel distribution feature extractor based on a ViT model to obtain the palm vein blood vessel distribution feature vector. It should be understood that the purpose of step S121 is to process the face image using a face feature extractor based on a convolutional neural network model to obtain a face feature vector. Convolutional neural networks (Convolutional Neural Network, CNN) are a deep learning model, dedicated to image processing tasks. By passing image data between the convolution layer, pooling layer, and full-connection layer, CNNs are able to learn to extract features in the image, such as edges, textures, shapes, and the like. In this step, the face feature extractor generates a feature vector representing the face by learning features in the face image. The feature vector can capture the uniqueness and important features of the face and is used for subsequent face recognition and face comparison tasks. The purpose of step S122 is to process the palm vein blood vessel distribution image using a palm vein blood vessel distribution feature extractor based on the ViT model to obtain a palm vein blood vessel distribution feature vector. ViT (Vision Transformer) is an image processing model based on a transducer architecture. Conventional convolutional neural networks, when processing an image, divide the image into small blocks and then process each small block. While the ViT model directly inputs the entire image as a sequence and uses a self-attention mechanism to learn the features in the image. In this step, the palm vein blood vessel distribution feature extractor learns the features in the palm vein blood vessel distribution image to generate a feature vector representing the palm vein blood vessel distribution. This feature vector may be used for palmar vein recognition, palmar vein comparison, and the like. Through the two steps, the face image and the palm vein vascular distribution image can be converted into corresponding feature vectors, and the feature vectors can be used as unique identifications of the face and the palm veins for subsequent application such as identity verification and identification.

More specifically, in an embodiment of the present disclosure, as shown in fig. 4, passing the palm vein blood vessel distribution image through a palm vein blood vessel distribution feature extractor based on a ViT model to obtain the palm vein blood vessel distribution feature vector includes: s1221, performing image blocking processing on the palm vein blood vessel distribution image to obtain a sequence of palm vein blood vessel distribution image blocks; s1222, using the embedding layer of the palm vein blood vessel distribution feature extractor based on the ViT model to respectively perform embedding encoding on each palm vein blood vessel distribution image block in the sequence of the palm vein blood vessel distribution image blocks so as to obtain a sequence of palm vein blood vessel distribution image block embedding vectors; and S1223, inputting the sequence of the palm vein distribution image block embedded vectors into a converter module of the palm vein distribution feature extractor based on the ViT model to obtain the palm vein distribution feature vector. It should be understood that the purpose of step S1221 is to perform an image segmentation process on the palm vein vascularity image, dividing the whole image into a sequence of small blocks. The purpose of this is to convert the large-size image into a small-size image block suitable for ViT model processing. The ViT model has a certain limit on the size of the input image, and by partitioning the image, the size of the image block can be ensured to meet the requirements of the ViT model. The purpose of the S1222 step is to use an embedding layer of the palm vein blood vessel distribution feature extractor based on the ViT model to perform embedding encoding on each image block in the sequence of palm vein blood vessel distribution image blocks, so as to obtain an embedded vector sequence of the palm vein blood vessel distribution image blocks. Embedding vectors is a way of vector representation of image blocks that captures characteristic information of the image blocks. In this step, the embedding layer will convert each image block into corresponding embedding vectors, which will be used for subsequent feature extraction and processing. The purpose of step S1223 is to input the sequence of palm vein distribution image block embedding vectors into the converter module of the palm vein distribution feature extractor based on the ViT model to obtain the final palm vein distribution feature vector. The converter module is a core part of the ViT model, and can process and integrate the input embedded vector sequence to extract the relation and global characteristics between image blocks. Through this step, the palm vein blood vessel distribution feature extractor can generate a feature vector representing the entire palm vein blood vessel distribution, which can be used for subsequent tasks such as palm vein identification, verification, and the like.

More specifically, in step S1223, inputting the sequence of the palm vein distribution image block embedding vectors into the converter module of the ViT model-based palm vein distribution feature extractor to obtain the palm vein distribution feature vector includes: inputting the sequence of the palm vein blood vessel distribution image block embedded vectors into a converter module of the palm vein blood vessel distribution feature extractor based on the ViT model to obtain a plurality of palm vein blood vessel local distribution feature vectors; and cascading the plurality of palm vein blood vessel local distribution feature vectors to obtain the palm vein blood vessel distribution feature vector.

It is noted that ViT (Vision Transformer) is an image processing model based on a transducer architecture. The transducer was originally a model for natural language processing tasks, but was applied in ViT to the field of image processing. Conventional convolutional neural networks (Convolutional Neural Network, CNN) divide an image into small blocks and then process each small block when processing the image. While the ViT model directly inputs the entire image as a sequence and uses a self-attention mechanism to learn the features in the image. The core of model ViT is the encoder part of the transducer. The encoder is composed of a plurality of self-attention layers and a feedforward neural network layer. In the self-attention layer, the model may learn the relationship between different locations in the image and capture global context information. Through the stack of multiple self-attention layers, the ViT model is able to gradually extract features in an image and perform feature representation. In order for the ViT model to be able to process the image data, the image needs to be partitioned, with each image block being an element in the sequence. Each image block is first embedded encoded by a linear projection layer and then passed as input to the encoder of the ViT model in combination with position encoding. The position coding is used to provide position information of the image blocks throughout the sequence. The ViT model has advantages in processing large-sized images and capturing global context information over conventional convolutional neural networks, the ViT model.

And then, carrying out feature interaction on the face feature vector and the palm vein blood vessel distribution feature vector to obtain a multi-mode biological feature vector. That is, the respective limitations are remedied by feature interactions to fully exploit the complementarity of the two biological features.

In a specific example of the present disclosure, as shown in fig. 5, the encoding process for performing feature interaction on the face feature vector and the palmar vein distribution feature vector to obtain a multi-modal biometric vector includes: s131, the facial feature vector and the palm vein blood vessel distribution feature vector pass through a feature fusion module based on a bidirectional attention mechanism to obtain a palm vein blood vessel distribution feature vector containing facial features and a facial feature vector containing blood vessel distribution features; and S132, fusing the palm vein blood vessel distribution feature vector containing the facial features and the facial feature vector containing the blood vessel distribution features to obtain the multi-mode biological feature vector. It should be understood that the feature fusion module in step S131 performs feature interaction on the face feature vector and the palmar vein distribution feature vector by using a method based on a bidirectional attention mechanism, so as to obtain a vector representation fused with the two features. In this step, by using the bidirectional attention mechanism, the module can simultaneously consider the relationship between the face feature vector and the palmar vein vascularity feature vector. Specifically, the feature fusion module calculates the attention weight of the face feature vector to the palm vein blood vessel distribution feature vector, and the attention weight of the palm vein blood vessel distribution feature vector to the face feature vector. Thus, the fused palmar vein blood vessel distribution feature vector will contain information of the facial feature, and the fused facial feature vector will contain information of the blood vessel distribution feature. In step S132, the palm vein blood vessel distribution feature vector containing the human face feature and the human face feature vector containing the blood vessel distribution feature are fused to obtain the multi-mode biological feature vector. By fusing the two feature vectors, the information of the human face and the palm vein can be fully utilized, and the expression capability and the discrimination of biological features can be improved. The multi-mode biological feature vector can be used for subsequent tasks such as biological feature recognition and verification, so that the accuracy and the safety of the system are improved.

More specifically, in step S131, as shown in fig. 6, the step of passing the face feature vector and the palmar vein blood vessel distribution feature vector through a feature fusion module based on a bidirectional attention mechanism to obtain a palmar vein blood vessel distribution feature vector including a face feature and a face feature vector including a blood vessel distribution feature includes: s1311, connecting the face feature vector and the palm vein blood vessel distribution feature vector in series to obtain a first series vector; s1312, passing the first serial vectors through a Softmax layer to obtain first attention weight vectors; s1313, vector multiplication is carried out on the face feature vector and the first attention weight vector to obtain a first fusion vector; s1314, the first fusion vector passes through a full connection layer to obtain the facial feature vector containing the vascularity feature; s1315, passing the palm vein vascular distribution feature vector through a Softmax layer to obtain a second attention weight vector; s1316, vector multiplying the second attention weight vector and the palm vein vascular distribution feature vector to obtain a second fusion vector; and S1317, passing the second fusion vector through a full connection layer to obtain the facial feature vector containing the vascularization features.

And then, the multi-mode biological feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the user to be verified is an authorized user or not. Accordingly, determining whether the user to be authenticated is an authorized user based on the multimodal biometric vector includes: and the multi-mode biological feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the user to be verified is an authorized user or not.

More specifically, the multi-mode biometric vector is passed through a classifier to obtain a classification result, where the classification result is used to indicate whether the user to be verified is an authorized user, and the method includes: performing full-connection coding on the multi-mode biological feature vector by using a full-connection layer of the classifier to obtain a coded classification feature vector; and inputting the coding classification feature vector into a Softmax classification function of the classifier to obtain the classification result.

That is, in the technical solution of the present disclosure, the tag of the classifier includes that the user to be authenticated is an authorized user (first tag) and that the user to be authenticated is not an authorized user (second tag), wherein the classifier determines to which classification tag the multi-modal biometric vector belongs through a soft maximum function. It should be noted that the first tag p1 and the second tag p2 do not contain a manually set concept, and in fact, during the training process, the computer model does not have a concept of "whether the user to be authenticated is an authorized user", which is simply that there are two kinds of classification tags and the probability that the output feature is under the two kinds of classification tags, i.e., the sum of p1 and p2 is one. Therefore, the classification result of whether the user to be verified is the authorized user is actually converted into the classified probability distribution conforming to the natural rule through classifying the labels, and the physical meaning of the natural probability distribution of the labels is essentially used instead of the language text meaning of whether the user to be verified is the authorized user.

It should be appreciated that the role of the classifier is to learn the classification rules and classifier using a given class, known training data, and then classify (or predict) the unknown data. Logistic regression (logistics), SVM, etc. are commonly used to solve the classification problem, and for multi-classification problems (multi-class classification), logistic regression or SVM can be used as well, but multiple bi-classifications are required to compose multiple classifications, but this is error-prone and inefficient, and the commonly used multi-classification method is the Softmax classification function.

Further, in the technical scheme of the present disclosure, the multi-mode recognition method for fusing the face and the palm vein further includes a training step: training the face feature extractor based on the convolutional neural network model, the palm vein vascularity feature extractor based on the ViT model, the feature fusion module based on the bidirectional attention mechanism and the classifier. It should be appreciated that the training step serves to learn parameters of the overall multi-modal identification system, enabling it to learn efficient feature representation and classification decision capability from the training data. Specifically, the training step comprises training a face feature extractor, a palm vein vascular distribution feature extractor, a feature fusion module and a classifier. For a face feature extractor based on a convolutional neural network model, the training step aims at adjusting network parameters according to a face image and a corresponding label in training data by a back propagation algorithm, so that the face feature extractor can accurately extract features in the face image and map the features to a high-dimensional feature vector space. For the palm vein blood vessel distribution feature extractor based on the ViT model, the training step aims at adjusting network parameters according to the palm vein blood vessel distribution image and the corresponding label in training data by a back propagation algorithm, so that the palm vein blood vessel distribution feature extractor can accurately extract the features in the blood vessel distribution image and map the features to a high-dimensional feature vector space. For the feature fusion module based on the bidirectional attention mechanism, the training step aims at adjusting parameters of the module according to the face feature vector, the palm vein blood vessel distribution feature vector and the corresponding labels in training data through a back propagation algorithm, so that the feature fusion module can learn an effective feature interaction mode and fuse the face features and the blood vessel distribution features. Finally, for the classifier, the training step aims at adjusting parameters of the classifier according to the multi-mode biological feature vector and the corresponding label in the training data through a back propagation algorithm, so that the classifier can carry out accurate classification decision based on the fused feature vector. Through the training step, the whole multi-modal identification system can learn the expression and fusion modes of different modal characteristics from training data, so that the accuracy and the robustness of biological characteristic identification are improved.

Wherein, more specifically, the training step comprises: acquiring training data, wherein the training data comprises training face images and training palm vein vascular distribution images of users to be verified, and whether the users to be verified are real values of authorized users or not; the training face image passes through the face feature extractor based on the convolutional neural network model to obtain a training face feature vector; passing the training palm vein blood vessel distribution image through the ViT model-based palm vein blood vessel distribution feature extractor to obtain a training palm vein blood vessel distribution feature vector; the training human face feature vector and the training palm vein blood vessel distribution feature vector pass through the feature fusion module based on the bidirectional attention mechanism to obtain a training palm vein blood vessel distribution feature vector containing human face features and a training human face feature vector containing blood vessel distribution features; fusing the training palm vein blood vessel distribution feature vector containing the face features and the training face feature vector containing the blood vessel distribution features to obtain a training multi-mode biological feature vector; passing the training multi-modal biological feature vector through a classifier to obtain a classification loss function value; and training the face feature extractor based on the convolutional neural network model, the palm vein distribution feature extractor based on the ViT model, the feature fusion module based on the bidirectional attention mechanism and the classifier by using the classification loss function value, wherein in each round of iteration of training, the training multi-mode biological feature vector is subjected to weight space exploration constraint iteration based on regularization of a class matrix.

In the technical scheme of the disclosure, the training face feature vector and the training palm vein blood vessel distribution feature vector respectively represent image semantic local association features based on different association dimensions and association dimensions of the training face image and the training palm vein blood vessel distribution image, so that when the training face feature vector and the training palm vein blood vessel distribution feature vector are subjected to feature fusion module based on a bidirectional attention mechanism to obtain the training palm vein blood vessel distribution feature vector containing face features and the training face feature vector containing blood vessel distribution features, and are fused with the training palm vein blood vessel distribution feature vector containing face features and the training face feature vector containing blood vessel distribution features to obtain the training multi-mode biological feature vector, the training multi-mode biological feature vector simultaneously contains image semantic feature representations of cross-feature extraction dimensions and source image semantic modes, and the multi-mode biological feature vector is diversified in feature representation under different image distribution dimensions, so that when the multi-mode biological feature vector is subjected to regression through a classifier, the multi-mode biological feature vector influences the effect of the multi-mode biological feature vector in a classification matrix classification process in a probability distribution and a feature distribution of a diversity dimension and a feature distribution of a diversity mode, and a feature matrix is correspondingly affected in a classification process.

Based on the above, when classifying the training multi-modal biological feature vector by the classifier, the applicant of the present disclosure performs a weight space exploration constraint based on regularization of a class matrix on the training multi-modal biological feature vector at each iteration of the weight matrix.

Accordingly, in one particular example, the training is multiple times during each iteration of the trainingThe modal biological feature vector performs weight space exploration constraint iteration based on class matrix regularization, and comprises the following steps: in each round of iteration of the training, carrying out weight space exploration constraint iteration based on class matrix regularization on the training multi-mode biological feature vector by using the following optimization formula to obtain an optimized training multi-mode biological feature vector; wherein, the optimization formula is:wherein (1)>Is the training multimodal biometric vector, < >>Expressed in column vector>Is the optimized training multimodal biometric vector, < >>Is a row vector, +.>Is the weight matrix of the last iteration, +.>For a learnable domain transfer matrix, for example, the weight matrix can be initially set to the last iteration +.>Diagonal matrix of diagonal elements, +. >Representing matrix multiplication +.>Representing a transpose operation->Representing the weight matrix after iteration。

Here, consider the weight spatial domain of the weight matrix and the training multi-modal biometric vectorDomain differences (domain gap) between probability distribution domains of classification results of (2) by weight matrix +.>Relative to the training multimodal biometric vector +.>The regularized representation of the class matrix of (2) is used as an inter-domain migration agent (inter-domain transferring agent) to transfer the probability distribution of valuable label constraint into a weight space, so that excessive exploration (over-explloit) of the weight distribution in the weight space by a rich labeled probability distribution domain in the classification process based on the weight space is avoided, the convergence effect of the weight matrix is improved, and the training effect of the training multi-mode biological feature vector in classification regression through a classifier is also improved.

In summary, according to the multi-mode identification method for fusing the human face and the palm vein according to the embodiment of the disclosure, the complementarity of two biological characteristics can be fully utilized to perform multi-mode identification, and identification failure caused by the influence of environment or individual change on a single biological characteristic is avoided.

Fig. 7 shows a block diagram of a multimodal recognition system 100 that fuses a human face and a palm vein in accordance with an embodiment of the present disclosure. As shown in fig. 7, a multi-modality recognition system 100 for fusing a human face and a palm vein according to an embodiment of the present disclosure includes: an image acquisition module 110, configured to acquire a face image and a palmar vein distribution image of a user to be authenticated; the image feature extraction module 120 is configured to perform image feature extraction on the face image and the palm vein blood vessel distribution image to obtain a face feature vector and a palm vein blood vessel distribution feature vector; the feature interaction module 130 is configured to perform feature interaction on the face feature vector and the palm vein blood vessel distribution feature vector to obtain a multi-mode biological feature vector; and a user authentication module 140, configured to determine, based on the multimodal biometric vector, whether the user to be authenticated is an authorized user.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the above-described multi-modality recognition system 100 for merging a human face and a palm vein have been described in detail in the above description of the multi-modality recognition method for merging a human face and a palm vein with reference to fig. 1 to 6, and thus, repetitive descriptions thereof will be omitted.

As described above, the multi-modal identification system 100 fusing a human face and a palm vein according to the embodiment of the present disclosure may be implemented in various wireless terminals, such as a server or the like having a multi-modal identification algorithm fusing a human face and a palm vein. In one possible implementation, the multi-modality recognition system 100 that fuses a human face and a palm vein according to embodiments of the present disclosure may be integrated into a wireless terminal as one software module and/or hardware module. For example, the multi-modal identification system 100 that merges facial and palm veins may be a software module in the operating system of the wireless terminal, or may be an application developed for the wireless terminal; of course, the multi-modality recognition system 100 that merges the facial and palm veins can also be one of many hardware modules of the wireless terminal.

Alternatively, in another example, the multi-modality recognition system 100 fusing the human face and the palm vein and the wireless terminal may be separate devices, and the multi-modality recognition system 100 fusing the human face and the palm vein may be connected to the wireless terminal through a wired and/or wireless network and transmit interactive information according to a contracted data format.

Fig. 8 illustrates an application scenario diagram of a multimodal recognition method fusing a human face and a palm vein according to an embodiment of the present disclosure. As shown in fig. 8, in this application scenario, first, a face image (e.g., D1 illustrated in fig. 8) and a palm vein distribution image (e.g., D2 illustrated in fig. 8) of a user to be authenticated are acquired, and then the face image and the palm vein distribution image are input to a server (e.g., S illustrated in fig. 8) in which a multimodal recognition algorithm that fuses a face and a palm vein is deployed, wherein the server is capable of processing the face image and the palm vein distribution image using the multimodal recognition algorithm that fuses a face and a palm vein to obtain a classification result for indicating whether the user to be authenticated is an authorized user.

Furthermore, it should be noted that the technical solution of the present disclosure relies on face and palm vein algorithm recognition libraries, and after two recognition libraries are transplanted on the device side, a private protocol with the receiving device needs to be customized to complete the successful content transmission. Aiming at a single face, a receiving end is required to send an instruction operation for identifying or inputting the single face, the instruction is issued to identification equipment to analyze the instruction, and then the single face is input or identified; aiming at the single palm vein, a receiving end is required to send an identification or input command which is different from a single face, and the single palm vein is input or identified after the single face is issued to identification equipment; aiming at the mixed mode of palm vein and human face, the user can select which is input by setting and selecting the command different from the single palm and the single human face, can identify the mixed mode, and can select to identify the palm or the human face preferentially under the same identification command. Accordingly, fig. 9-11 show schematic diagrams of the recognition flow based on the palmar vein recognition instruction, the face recognition instruction, and the hybrid recognition instruction of the present disclosure, respectively.

Specifically, regarding palm vein recognition judgment, the recognition end needs to complete sensor configuration, and based on the sensor, exposure is adjusted to achieve that the palm can display veins and extract clear images of features at a proper distance. Firstly, detecting an image by palm matrix detection of a palm vein algorithm, extracting the size of a rectangle occupied by a palm in the current image, secondly, extracting key information from a rectangular range, mainly extracting vein information and a little palm print information, judging whether the image quality comprises parameters such as an exposure value, a palm distance, a palm inclination angle and the like accords with the expected algorithm or not, and then evaluating a palm living body threshold value to ensure that the image is a real palm. After the detection of the steps is completed, the key biological characteristic information of the palm is extracted, the information can be stored for comparison with the characteristic information extracted later, and the palm with the comparison score higher than a certain threshold value can be regarded as the palm of the same person.

Regarding face recognition judgment, face management needs to complete sensor configuration, face exposure adjustment needs to be distinguished from palm vein exposure, the distance that the face can be recognized is farther, and the image of the clear face can be seen by adjusting proper exposure. According to clear face images, a face algorithm library is operated, the images are detected at first, the number of faces in the images and rectangular coordinates of the images where the faces are located are obtained, the faces which are closer to a sensor are conveniently distinguished at present for identification, and face quality judgment and living body judgment are completed after key points of the current faces are extracted. And extracting key characteristic value information of the human face, wherein the information can be stored for comparing information extracted from each time of face recognition, and the information is considered to be the same human face when the information is higher than a certain threshold value.

Regarding the mixed recognition judgment of the human face and the palm, the mixed recognition of the human face and the palm depends on the fine processing of the image exposure, the image exposure of the palm is firstly adjusted based on a human face library and a palm vein library, a palm recognition step is immediately executed after the palm is successfully judged by operating a palm detection interface, and if no palm exists, the exposure value is adjusted to be suitable for the human face, and the human face is detected. After a round of judgment is completed, the palm recognition step is executed when the palm exists, the face recognition step is not executed any more, if the palm does not exist, the face judgment is performed, and the face recognition step is performed after the face exists. When the two images are not recognized, the human face exposure is returned to the palm exposure again to perform palm recognition judgment, the image exposure is finely adjusted in the palm recognition process, the image exposure in the second recognition process is different from the previous exposure value, the operation steps are repeated at the moment, and the image exposure is stopped after a certain time, and the human face exposure and the palm recognition process are not recognized.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A multi-mode identification method for fusing a human face and a palm vein is characterized by comprising the following steps:

2. The multi-modal identification method of merging a human face and a palm vein according to claim 1, wherein performing image feature extraction on the human face image and the palm vein blood vessel distribution image to obtain a human face feature vector and a palm vein blood vessel distribution feature vector, comprises:

the face image is passed through a face feature extractor based on a convolutional neural network model to obtain the face feature vector; and

the palm vein blood vessel distribution image is passed through a palm vein blood vessel distribution feature extractor based on ViT model to obtain the palm vein blood vessel distribution feature vector.

3. The method of claim 2, wherein passing the palm vein blood vessel distribution image through a ViT model-based palm vein blood vessel distribution feature extractor to obtain the palm vein blood vessel distribution feature vector comprises:

performing image blocking processing on the palm vein blood vessel distribution image to obtain a sequence of palm vein blood vessel distribution image blocks;

using an embedding layer of the palm vein distribution feature extractor based on the ViT model to respectively carry out embedding encoding on each palm vein distribution image block in the sequence of the palm vein distribution image blocks so as to obtain a sequence of palm vein distribution image block embedding vectors; and

Inputting the sequence of the palm vein distribution image block embedded vectors into a converter module of the palm vein distribution feature extractor based on the ViT model to obtain the palm vein distribution feature vectors.

4. A method of multimodal recognition of fused faces and palmveins as claimed in claim 3 wherein inputting the sequence of palmar vein distribution image block embedding vectors into the converter module of the ViT model-based palmar vein distribution feature extractor to obtain the palmar vein distribution feature vectors comprises:

inputting the sequence of the palm vein blood vessel distribution image block embedded vectors into a converter module of the palm vein blood vessel distribution feature extractor based on the ViT model to obtain a plurality of palm vein blood vessel local distribution feature vectors; and

cascading the plurality of palm vein blood vessel local distribution feature vectors to obtain the palm vein blood vessel distribution feature vector.

5. The method for multimodal recognition of fused face and palm vein according to claim 4, wherein performing feature interaction on the face feature vector and the palm vein blood vessel distribution feature vector to obtain a multimodal biometric vector comprises:

The facial feature vector and the palm vein blood vessel distribution feature vector pass through a feature fusion module based on a bidirectional attention mechanism to obtain a palm vein blood vessel distribution feature vector containing facial features and a facial feature vector containing blood vessel distribution features; and

and fusing the palm vein blood vessel distribution feature vector containing the human face features and the human face feature vector containing the blood vessel distribution features to obtain the multi-mode biological feature vector.

6. The method for multimodal recognition of a fused face and palmar vein according to claim 5, wherein passing the face feature vector and the palmar vein blood vessel distribution feature vector through a feature fusion module based on a bidirectional attention mechanism to obtain a palmar vein blood vessel distribution feature vector containing a face feature and a face feature vector containing a blood vessel distribution feature, comprises:

the face feature vector and the palm vein blood vessel distribution feature vector are connected in series to obtain a first serial vector;

passing the first series of vectors through a Softmax layer to obtain a first attention weight vector;

vector multiplication is carried out on the face feature vector and the first attention weight vector to obtain a first fusion vector;

The first fusion vector passes through a full connection layer to obtain the facial feature vector containing the vascularity feature;

passing the palm vein vascular distribution feature vector through a Softmax layer to obtain a second attention weight vector;

vector multiplication is carried out on the second attention weight vector and the palm vein blood vessel distribution feature vector to obtain a second fusion vector; and

and passing the second fusion vector through a full connection layer to obtain the facial feature vector containing the vascularity feature.

7. The method for multimodal recognition of fused face and palm vein of claim 6 wherein determining whether the user to be authenticated is an authorized user based on the multimodal biometric vector comprises:

and the multi-mode biological feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the user to be verified is an authorized user or not.

8. The method for multimodal recognition of fused face and palm veins of claim 7, further comprising the training step of: training the face feature extractor based on the convolutional neural network model, the palm vein vascular distribution feature extractor based on the ViT model, the feature fusion module based on the bidirectional attention mechanism and the classifier;

Wherein the training step comprises:

acquiring training data, wherein the training data comprises training face images and training palm vein vascular distribution images of users to be verified, and whether the users to be verified are real values of authorized users or not;

the training face image passes through the face feature extractor based on the convolutional neural network model to obtain a training face feature vector;

passing the training palm vein blood vessel distribution image through the ViT model-based palm vein blood vessel distribution feature extractor to obtain a training palm vein blood vessel distribution feature vector;

the training human face feature vector and the training palm vein blood vessel distribution feature vector pass through the feature fusion module based on the bidirectional attention mechanism to obtain a training palm vein blood vessel distribution feature vector containing human face features and a training human face feature vector containing blood vessel distribution features;

fusing the training palm vein blood vessel distribution feature vector containing the face features and the training face feature vector containing the blood vessel distribution features to obtain a training multi-mode biological feature vector;

passing the training multi-modal biological feature vector through a classifier to obtain a classification loss function value; and

Training the face feature extractor based on the convolutional neural network model, the palm vein vascular distribution feature extractor based on the ViT model, the feature fusion module based on the bidirectional attention mechanism and the classifier by using the classification loss function value, wherein in each round of iteration of training, the training multi-mode biological feature vector is subjected to weight space exploration constraint iteration based on class matrix regularization.

9. The method of claim 8, wherein in each iteration of the training, performing a weight space exploration constraint iteration based on a class matrix regularization on the training multi-modal biometric vector comprises:

in each round of iteration of the training, carrying out weight space exploration constraint iteration based on class matrix regularization on the training multi-mode biological feature vector by using the following optimization formula to obtain an optimized training multi-mode biological feature vector;

wherein, the optimization formula is:wherein (1)>Is the training multimodal biometric vector, < >>Expressed in column vector>Is the optimized training multimodal biometric vector, < > >Is the vector of the row and,is the weight matrix of the last iteration, +.>Is a domain transfer matrix which can be learned, +.>Representing matrix multiplication +.>Representing a transpose operation->Representing the weight matrix after iteration.

10. A multi-modality facial and palmar venous fusion recognition system, comprising: