CN112016523A

CN112016523A - Cross-modal face recognition method, device, equipment and storage medium

Info

Publication number: CN112016523A
Application number: CN202011027564.6A
Authority: CN
Inventors: 田飞
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2020-12-01
Anticipated expiration: 2040-09-25
Also published as: CN112016523B

Abstract

The application discloses a cross-modal face recognition method, a cross-modal face recognition device, a cross-modal face recognition equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to deep learning and computer vision technologies. The specific implementation scheme is as follows: for each mode of at least two modes, adopting the face image data of the mode of the sample user to carry out single-mode training on the face recognition model; performing cross-modal training on the face recognition model by adopting the face image data of the at least two modalities of the sample user; wherein the single mode training process and the cross mode training process share model parameters; and taking the face recognition model after the monomodal training and the cross-modal training as a cross-modal face recognition model to be used. The method and the device can improve the accuracy of cross-mode face recognition.

Description

Cross-modal face recognition method, device, equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to deep learning and computer vision technologies, and in particular relates to a cross-modal face recognition method, device, equipment and storage medium.

Background

The human face is unique and hard to copy as other biological characteristics of the human body such as fingerprints and irises, so that identification, namely face recognition, can be performed based on the facial characteristic information of the human body. Face recognition technology has been widely used in various fields of work and life.

The face images of different modes can be collected through different types of image collectors, and how to identify the face images of different modes is an important problem in the industry.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for cross-modality face recognition.

According to an aspect of the present disclosure, a method of cross-modal face recognition is provided, including:

for each mode of at least two modes, adopting the face image data of the mode of the sample user to carry out single-mode training on the face recognition model;

performing cross-modal training on the face recognition model by adopting the face image data of the at least two modalities of the sample user; wherein the single mode training process and the cross mode training process share model parameters;

and taking the face recognition model after the monomodal training and the cross-modal training as a cross-modal face recognition model to be used.

According to an aspect of the present disclosure, there is provided an apparatus for cross-modal face recognition, including:

the single-mode training module is used for performing single-mode training on the face recognition model by adopting the face image data of the mode of the sample user aiming at each mode of at least two modes;

the cross-modal training module is used for performing cross-modal training on the face recognition model by adopting the face image data of the at least two modalities of the sample user; wherein the single mode training process and the cross mode training process share model parameters;

and the model determining module is used for taking the face recognition model subjected to the single mode training and the cross mode training as a cross mode face recognition model to be used.

According to a third aspect, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of cross-modality face recognition as described in any one of the embodiments of the present application.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform a method of cross-modal face recognition as described in any one of the embodiments of the present application.

According to the technology of the application, the accuracy of cross-mode face recognition can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1a is a schematic flowchart of a cross-modal face recognition method according to an embodiment of the present application;

FIG. 1b is a schematic structural diagram of a face recognition model according to an embodiment of the present application;

FIG. 2 is a schematic flowchart of another cross-modality face recognition method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of another cross-modal face recognition method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an apparatus for cross-modal face recognition according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device for implementing a cross-modal face recognition method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1a is a schematic flowchart of a cross-modal face recognition method according to an embodiment of the present application. The method and the device can be suitable for the condition of constructing the cross-modal face recognition model for recognizing different modal face images. The method for cross-modal face recognition disclosed in this embodiment may be executed by an electronic device, and specifically may be executed by a device for cross-modal face recognition, where the device may be implemented by software and/or hardware and configured in the electronic device. Referring to fig. 1a, the method for cross-modal face recognition provided in this embodiment includes:

and S110, aiming at each mode of at least two modes, performing single-mode training on the face recognition model by adopting the face image data of the mode of the sample user.

In the embodiment of the application, there may be sample face image data of at least two modalities, and the types of image collectors of different modalities are different. The number and the type of the modes are not particularly limited in the embodiment of the present application.

Specifically, for each mode, the sample data of the mode, that is, the sample user, is used to perform single-mode training on the face recognition model in the face image data of the mode, a loss function of the mode is constructed in the single-mode training process, and the model parameters in the face recognition model are adjusted according to the loss function of the mode. It should be noted that the unimodal training loss functions of different modes are independent of each other and do not interfere with each other.

Through the single-mode training in each mode, the trained face recognition model has good face recognition performance in each mode. Taking at least two modes as an RGB (Red Green Blue) mode and an NIR (Near Infrared spectrum) mode as an example, performing single-mode training on the face recognition model by using RGB sample data, and performing single-mode training on the face recognition model by using NIR sample data, that is, constructing a loss function of the RGB single mode and a loss function of the NIR single mode, where different loss functions are independent of each other.

And S120, performing cross-mode training on the face recognition model by adopting the face image data of the at least two modes of the sample user.

Specifically, the face image data of at least two modes are adopted to train the face recognition model at the same time, that is, the face image data of at least two modes are used as the input of the face recognition model, a cross-mode loss function comprising the data of at least two modes is constructed, and the model parameters in the face recognition model are adjusted based on the cross-mode loss function. The cross-modal loss function comprises information of at least two modes, so that the characteristic distribution of different modes can be fitted, the characteristic distribution of different modes is similar, and the problem of model overfitting caused by unbalanced sample data of each mode can be solved.

In the embodiment of the present application, the model parameters are shared by the monomodal training of different modalities, and the model parameters are shared by the monomodal training process and the cross-modality training process, that is, the model parameters are shared between different training processes. Fig. 1b is a schematic structural diagram of a face recognition model according to an embodiment of the present application, and referring to fig. 1b, the face recognition model may include an input layer, a feature extraction layer, and a multitask output layer, where the input layer may include an input unit of each single modality and a cross-modality input unit, respectively, the feature extraction layer is multiplexed, and the multitask output layer may include an output unit of each single modality face recognition task and a cross-modality output unit. In the single-mode training process of each mode, inputting sample data of the mode into a corresponding single-mode input unit, and obtaining output information of a single-mode task after performing feature extraction through a multiplexed feature extraction layer; in the cross-modal training process, sample data of different modes are input into a cross-modal input unit, and output information of the cross-modal task is obtained after the sample data are subjected to a multiplexed feature extraction layer.

Correspondingly, in each single-mode training process, a loss function of the single mode is respectively constructed according to the label information and the output information of the single mode, in the cross-mode training process, a cross-mode loss function is constructed according to the label information and the output information of at least two modes, and different loss functions are independent and do not interfere with each other.

And S130, taking the face recognition model after the single-mode training and the cross-mode training as a cross-mode face recognition model to be used.

The cross-modal face recognition model has both single-modal training and cross-modal training, so that the cross-modal face recognition model has good performance under each single mode through at least two single-modal loss functions, and also has good performance under different modes through the cross-modal loss functions, namely, the constructed cross-modal face recognition model can recognize face images of at least two modes.

In an alternative embodiment, the sample data volume of the at least two modalities is different. Under the condition that the sample data difference between different modes is large, if the cross-mode training (which may be called hybrid training) is performed on the face recognition model by only adopting the sample data of at least two modes, the model is easy to be over-fitted, so that the face recognition accuracy is low in the mode with a small sample amount. According to the embodiment of the application, the single-mode training and the cross-mode training are matched, so that the problem of data sampling balance does not need to be considered, and the requirements of face recognition accuracy can be met under different modes.

According to the technical scheme, the cross-modal face recognition model is enabled to have good performance under each mode by sampling the sample data of each mode to respectively conduct single-mode training on the face recognition model, and the cross-modal training is conducted on the face recognition model through the sample data of at least two modes, so that the characteristic distribution difference of different modes is reduced, and the face recognition performance is further improved. Especially, under the condition that different modal sample quantities are unbalanced, overfitting of the model can be avoided, and the face recognition accuracy is improved.

Fig. 2 is a schematic flowchart of a cross-modal face recognition method according to an embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 2, the method for cross-modal face recognition provided in this embodiment includes:

s210, aiming at each mode of at least two modes, adopting the face image data of the mode of the sample user to carry out single-mode training on the face recognition model.

For convenience of description, two modalities, RGB and NIR, may be used as an example. Specifically, during the training process, the sample data of each batch (batch) may be divided into three parts, namely, RGB sample, NIR sample, and RGB and NIR mixed sample. RGB single-mode training is performed on the face image data using RGB samples, and NIR single-mode training is performed on the face image data using NIR mixed samples.

S220, carrying out interpolation processing on the facial image data of at least two modes of the sample user to obtain the processed facial image data of each mode.

Specifically, an incidence relation between each modal sample of a sample user in the mixed sample is established, and interpolation processing is performed on face image data of at least two modalities of the sample user, so that the interpolated single-modality face image data includes both the face image data of the single modality and face image data of other modalities. Still taking the RGB and NIR modes as examples, establishing an association relation between the RGB sample and the NIR sample of a sample user; and determining the RGB sample and the NIR sample interpolated by the sample user according to the RGB sample and the NIR sample of the sample user, so that the interpolated RGB sample has the information of the original RGB sample and the information of the original NIR sample, and the interpolated NIR sample also has the information of the original RGB sample and the information of the original NIR sample. Because the interpolated single-mode face image data not only has the information of the single mode itself but also fuses the information of other modes, the cross-mode training performed by using the interpolated single-mode face image data can make the feature distribution of different modes closer, that is, the image smooth fitting makes the feature distribution of different modes closer, thereby improving the accuracy of the cross-mode face recognition.

In an alternative embodiment, S220 includes: generating interpolation coefficients for the sample users; wherein the mean value of the interpolation coefficients is a fixed value; and determining the processed face image data of the mode according to the face image data of each mode of the sample user, the interpolation coefficient and the face image data of other modes.

The interpolation coefficient refers to the maintenance weight of the original mode in the interpolation process. Specifically, the original face image data of each modality and the original face image data of other modalities are fused according to the interpolation coefficient, so as to obtain new face image data of each modality after interpolation.

Still taking RGB and NIR modalities as an example, the interpolation can be specifically performed by the following formula:

RGB_IMG＝rgb_img×gamma+nir_img×(1-gamma)

NIR_IMG＝nir_img×gamma+rgb_img×(1-gamma)

wherein, RGB _ IMG and NIR _ IMG refer to pixel values in RGB and NIR face images after interpolation respectively, RGB _ IMG and NIR _ IMG refer to pixel values in original RGB and NIR face images before interpolation respectively, and gamma is an interpolation coefficient.

In the embodiment of the present application, the interpolation coefficients may be floating-point random numbers, and the mean value of the interpolation coefficients may be a fixed value between 0.9 and 1, so that the interpolation process can maintain the original mode information.

And S230, performing cross-mode training on the face recognition model by adopting the processed face image data of at least two modes.

The model parameters are shared by the monomodal training of different modes, and the model parameters are shared by the monomodal training process and the cross-modal training process. Specifically, the processed face image data of at least two modalities are used as the input of a face recognition model, a cross-modality loss function is constructed, and model parameters are adjusted based on the cross-modality loss function.

S240, using the face recognition model after the single-mode training and the cross-mode training as a cross-mode face recognition model to be used.

According to the technical scheme of the embodiment of the application, in the cross-modal training process, the characteristic distribution difference between different modes is further reduced by introducing image smoothing processing, so that the cross-modal face recognition performance is further improved.

Fig. 3 is a schematic flowchart of a cross-modal face recognition method according to an embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 3, the method for cross-modal face recognition provided in this embodiment includes:

and S310, aiming at each mode of at least two modes, performing single-mode training on the face recognition model by adopting the face image data of the mode of the sample user.

And S320, in the cross-mode training process, the face image data of each mode is used as the input of the face recognition model to obtain the mode characteristics output by the face recognition model.

Specifically, the facial image data of at least two modalities are used as the input of the face recognition model to respectively obtain at least two modality features, for example, the output of a feature extraction layer in the face recognition model can be directly used as the modality features.

S330, based on a pre-constructed feature classifier, at least two modal features are used as input of the feature classifier, and a cross-modal feature distance output by the feature classifier is obtained.

The feature classifier is used for classifying different modal features, namely, different modal features are input into the feature classifier to obtain different outputs. Taking RGB and NIR modalities as examples, after RGB features are input, the feature classifier may output 1; after the NIR features are input, the feature classifier may output-1.

Specifically, at least two modal features can be used as input of the feature classifier, a loss function between different modal features is constructed, and a loss function value between different modal features is used as a distance between different modal features, namely a cross-modal feature distance. Still taking RGB and NIR modalities as an example, the loss function between features of different modalities may be-E (D (NIR _ feature)) + E (D (RGB _ feature)), where D (NIR _ feature) is the loss value of the NIR feature, D (RGB _ feature) is the loss value of the RGB feature, and E is the averaging operation, i.e. averaging the loss values of the NIR feature in a batch of samples, and averaging the loss values of the RGB feature in the batch of samples.

In an alternative embodiment, the method further comprises: taking the face image data of each mode as the input of the face recognition model to obtain the mode characteristics output by the face recognition model; and training the feature classifier according to at least two modal features, so that the feature classifier classifies different modal features.

By training the feature classifier in advance, the feature classifier can classify the features of different modes, namely, the original feature distribution of different modes can be classified. The structure of the feature classifier is not specifically limited in the embodiment of the application, and in order to improve training efficiency and use efficiency, the feature classifier may be a lightweight network with a small number of layers, for example, the feature classifier may adopt a mobilene (mobile network).

S340, training the face recognition model according to the cross-modal characteristic distance.

The model parameters are shared by the monomodal training of different modes, and the model parameters are shared by the monomodal training process and the cross-modal training process.

The modal characteristics can be adjusted by training the face recognition model according to the cross-modal characteristic distance, and different adjusted modal characteristics are similar, so that the characteristic classifier cannot successfully distinguish the different adjusted modal characteristics. That is, the cross-modal feature distance is introduced to train the face recognition model, so that the adjusted different modal features are close until the feature classifier cannot classify, for example, until the output of the feature classifier is near 0. The cross-modal face recognition model is trained by introducing the cross-modal feature distance to adjust the distribution of each modal feature, so that the difference between different modal feature distributions is further reduced, and the accuracy of the cross-modal face recognition is further improved.

And S350, taking the face recognition model subjected to the single-mode training and the cross-mode training as a cross-mode face recognition model to be used.

It should be noted that, the above-mentioned S220 and S230 narrow the gap between the distributions of different modal features by introducing an image smoothing process; S320-S340 adjust the distribution of different modal features by introducing the cross-modal feature distance (or loss function between different modal features), so that the adjusted distributions of different modal features are similar, and the feature classifier cannot distinguish. The image smoothing process and the loss function between different modal features can also both be introduced.

According to the technical scheme, in the cross-modal training process, at least two modal characteristics are obtained by respectively taking face images of at least two modalities as the input of a face recognition model, the at least two modal characteristics are taken as the input of a pre-trained characteristic classifier, the cross-modal characteristic distance is determined according to the output of the characteristic classifier, and model parameters in the face recognition model are adjusted according to the cross-modal characteristic distance, namely, modal characteristic distribution is adjusted until the characteristic classifier cannot distinguish different adjusted modal characteristic distributions, namely, the adjusted different modal characteristic distributions are similar, the cross-modal face recognition performance can be further improved, and the purpose of adapting a cross-modal characteristic domain is achieved. By enabling the modal characteristic distribution with less sample amount to be closer to the modal characteristic distribution with more sample amount, the purpose of improving the identification effect of the modal data with less sample amount can be achieved by utilizing the modal data with more sample amount.

Fig. 4 is a schematic structural diagram of an apparatus for cross-modal face recognition according to an embodiment of the present application. Referring to fig. 4, an apparatus 400 for cross-modal face recognition provided in an embodiment of the present application may include:

a single-mode training module 401, configured to perform single-mode training on the face recognition model by using the face image data of the at least two modes of the sample user for each of the at least two modes;

a cross-modal training module 402, configured to perform cross-modal training on the face recognition model by using the face image data of the at least two modalities of the sample user; wherein the single mode training process and the cross mode training process share model parameters;

and a model determining module 403, configured to use the single-mode trained face recognition model and the cross-mode trained face recognition model as a cross-mode face recognition model to be used.

In an alternative embodiment, the cross-modality training module 402 includes:

the image interpolation unit is used for carrying out interpolation processing on the facial image data of at least two modalities of the sample user to obtain the processed facial image data of each modality;

and the cross-modal training unit is used for performing cross-modal training on the face recognition model by adopting the processed face image data of at least two modalities.

In an alternative embodiment, the image interpolation unit comprises:

an interpolation coefficient subunit, configured to generate an interpolation coefficient for the sample user; wherein the mean value of the interpolation coefficients is a fixed value;

and the image interpolation unit is used for determining the processed face image data of the modality according to the face image data of each modality of the sample user, the interpolation coefficient and the face image data of other modalities.

In an alternative embodiment, the cross-modality training module 402 includes:

the modal characteristic determining unit is used for taking the face image data of each mode as the input of the face recognition model in the cross-mode training process to obtain the modal characteristic output by the face recognition model;

the modal characteristic distance unit is used for taking at least two modal characteristics as the input of the characteristic classifier based on a pre-constructed characteristic classifier to obtain the cross-modal characteristic distance output by the characteristic classifier;

and the model training unit is used for training the face recognition model according to the cross-modal characteristic distance.

In an alternative embodiment, the apparatus 400 further comprises a classifier training module, the classifier training module comprising:

the modal characteristic output unit is used for taking the face image data of each mode as the input of the face recognition model to obtain the modal characteristic output by the face recognition model;

and the classifier training unit is used for training the feature classifier according to at least two modal features so that the feature classifier classifies different modal features.

In an alternative embodiment, the sample data volume of the at least two modalities is different.

According to the technical scheme, through the cooperation of the single-mode training and the cross-mode training, the face recognition performance under the single mode can be kept, the feature distribution differences of different modes can be reduced, particularly, the cross-mode feature distances between smooth images and different mode features are introduced to further reduce the feature distribution differences of different modes, and the face recognition accuracy can be improved.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 5 is a block diagram of an electronic device according to an embodiment of the present application, illustrating a cross-modal face recognition method. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 501 is taken as an example.

Memory 502 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the cross-modal face recognition method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the method of cross-modality face recognition provided herein.

The memory 502, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method of cross-modal face recognition in the embodiments of the present application (e.g., the single-modal training module 401, the cross-modal training module 402, and the model determination module 403 shown in fig. 4). The processor 501 executes various functional applications of the server and cross-modal face recognition by running non-transitory software programs, instructions, and modules stored in the memory 502, that is, the method of cross-modal face recognition in the above method embodiments is implemented.

The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device for cross-modality face recognition, and the like. Further, the memory 502 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 502 may optionally include memory remotely located from the processor 501, which may be connected to an electronic device across a modal face recognition via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the cross-modality face recognition method may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus across the modal face recognition, such as a touch screen, keypad, mouse, track pad, touch pad, pointing stick, one or more mouse buttons, track ball, joystick, etc. input devices. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of cross-modal face recognition, comprising:

2. The method of claim 1, wherein the cross-modal training of the face recognition model using the facial image data of the at least two modalities of the sample user comprises:

carrying out interpolation processing on the facial image data of at least two modes of the sample user to obtain the processed facial image data of each mode;

and performing cross-modal training on the face recognition model by adopting the processed face image data of at least two modes.

3. The method according to claim 2, wherein the interpolating the facial image data of at least two modalities of the sample user to obtain processed facial image data of each modality comprises:

generating interpolation coefficients for the sample users; wherein the mean value of the interpolation coefficients is a fixed value;

and determining the processed face image data of the mode according to the face image data of each mode of the sample user, the interpolation coefficient and the face image data of other modes.

4. The method of claim 1, wherein the cross-modal training of the face recognition model using the facial image data of the at least two modalities of the sample user comprises:

in the cross-modal training process, the face image data of each mode is used as the input of the face recognition model to obtain the mode characteristics output by the face recognition model;

based on a pre-constructed feature classifier, taking at least two modal features as the input of the feature classifier to obtain the cross-modal feature distance output by the feature classifier;

and training the face recognition model according to the cross-modal characteristic distance.

5. The method of claim 4, further comprising:

taking the face image data of each mode as the input of the face recognition model to obtain the mode characteristics output by the face recognition model;

and training the feature classifier according to at least two modal features, so that the feature classifier classifies different modal features.

6. The method according to any of claims 1-5, wherein the sample data volume of the at least two modalities is different.

7. An apparatus for cross-modal face recognition, comprising:

8. The apparatus of claim 7, wherein the cross-modality training module comprises:

9. The apparatus according to claim 8, wherein the image interpolation unit includes:

10. The apparatus of claim 7, wherein the cross-modality training module comprises:

11. The apparatus of claim 10, further comprising a classifier training module comprising:

12. The apparatus according to any of claims 7-11, wherein the sample data volume of the at least two modalities is different.

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.