CN115578797B

CN115578797B - Model training method, image recognition device and electronic equipment

Info

Publication number: CN115578797B
Application number: CN202211209807.7A
Authority: CN
Inventors: 王珂尧
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-08-29
Anticipated expiration: 2042-09-30
Also published as: CN115578797A

Abstract

The disclosure provides a model training method, an image recognition device and electronic equipment, relates to the technical field of artificial intelligence, and particularly relates to the technical field of image recognition and deep learning. The specific implementation scheme is as follows: acquiring a first RGB image and a first NIR image; inputting the first RGB image into a first model, acquiring a second NIR image output by the first model, and performing self-supervision training on the first model based on the second NIR image and the first NIR image; acquiring an encoder part in the first model after training, and taking the encoder part as a second model; performing two-classification living body supervision training on the second model based on the second RGB image and the third NIR image to obtain a target model; the target model is used for identifying an input image to be detected, and outputting an image identification result of the image to be detected, wherein the image to be detected is an RGB image or an NIR image.

Description

Model training method, image recognition device and electronic equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of image recognition and deep learning, and specifically relates to a model training method, an image recognition device and electronic equipment.

Background

The human face living body detection refers to distinguishing whether an image is shot by a real person or not, is a basic composition module of a human face recognition system, can ensure the safety of the human face recognition system, and is a current mainstream method by using a human face living body detection algorithm of a deep learning technology. Currently, a human face living body detection model is usually based on Red Green Blue (RGB) images to judge whether the images are photographed by real people, and the human face living body detection model based on the RGB images has high requirements on light sensitivity.

Disclosure of Invention

The disclosure provides a model training method, an image recognition device and electronic equipment.

According to an aspect of the present disclosure, there is provided a model training method including:

acquiring a first red, green and blue (RGB) image and a first Near Infrared (NIR) image;

inputting the first RGB image into a first model, acquiring a second NIR image output by the first model, and performing self-supervision training on the first model based on the second NIR image and the first NIR image;

acquiring an encoder part in the first model after training, and taking the encoder part as a second model;

performing two-classification living body supervision training on the second model based on the second RGB image and the third NIR image to obtain a target model;

The target model is used for identifying an input image to be detected, and outputting an image identification result of the image to be detected, wherein the image to be detected is an RGB image or an NIR image.

According to a second aspect of the present disclosure, there is provided an image recognition method including:

acquiring an image to be detected;

inputting the image to be detected into a target model, and acquiring an image recognition result output by the target model;

the target model is a model obtained by performing two-class living body supervision training on a second model based on a second RGB image and a third NIR image, the second model is an encoder part in a first model, the first model is a model obtained by performing self-supervision training based on the second NIR image and a first NIR image, and the second NIR image is an NIR image output by the first model after the first RGB image is input into the first model.

According to a third aspect of the present disclosure, there is provided a model training apparatus comprising:

the first acquisition module is used for acquiring a first red, green and blue (RGB) image and a first Near Infrared (NIR) image;

the first training module is used for inputting the first RGB image into a first model, acquiring a second NIR image output by the first model, and performing self-supervision training on the first model based on the second NIR image and the first NIR image;

A second acquisition module, configured to acquire an encoder portion in the first model after training, and take the encoder portion as a second model;

the second training module is used for performing two-class living body supervision training on the second model based on the second RGB image and the third NIR image to obtain a target model;

According to a fourth aspect of the present disclosure, there is provided an image recognition apparatus including:

the third acquisition module is used for acquiring an image to be detected;

the identification module is used for inputting the image to be detected into a target model and acquiring an image identification result output by the target model;

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first or second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the first or second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first or second aspect.

In the embodiment of the disclosure, the first model is self-supervised trained based on the real NIR image and the NIR image generated by the first model to obtain a trained first model, so that the trained first model has higher model precision, and the second model is an encoder part in the trained first model, and further has higher model precision, so that the target model obtained based on the second model training also has higher model precision, so as to ensure that the image recognition result output by the target model has higher accuracy.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a model training method provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of another model training method provided by an embodiment of the present disclosure;

FIG. 3 is a flowchart of an image recognition method provided by an embodiment of the present disclosure;

FIG. 4 is a block diagram of a model training apparatus provided in an embodiment of the present disclosure;

fig. 5 is a block diagram of an image recognition apparatus provided in an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device used to implement a model training method or an image recognition method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Referring to fig. 1, fig. 1 is a flowchart of a model training method according to an embodiment of the disclosure, where the method may be applied to an electronic device such as a computer, a mobile phone, or the like. As shown in fig. 1, the method comprises the steps of:

step S101, acquiring a first RGB image and a first NIR image.

It should be noted that, the first RGB image and the first Near Infrared (NIR) image in the embodiments of the present disclosure may be images including a human face. The electronic device may obtain the first RGB image and the first NIR image in a plurality of ways, for example, may be a network download, or may also be a photograph of the electronic device.

The first RGB image and the first NIR image may correspond to the same image scene, that is, the first RGB image and the first NIR image are images of two different modalities formed in the same image scene.

Step S102, inputting the first RGB image into a first model, obtaining a second NIR image output by the first model, and performing self-supervision training on the first model based on the second NIR image and the first NIR image.

The first model is used for generating and outputting an NIR image according to the input RGB image.

In this step, after the first RGB image is input into the first model, the first model can generate and output a corresponding second NIR image according to the first RGB image. It will be appreciated that the first RGB image and the first NIR image are images of two different modalities corresponding to the same image scene, and the second NIR image is generated by the first model from the input first RGB image, and the first model may be self-supervised trained based on the first NIR image and the generated second NIR image.

Specifically, a loss between the first NIR image and the second NIR image may be acquired, and the first model is self-supervised trained based on the loss, so that the NIR image output by the trained first model can be close to the real NIR image, and further, the model accuracy of the trained first model is improved.

Step S103, acquiring the trained encoder part in the first model, and taking the encoder part as a second model.

It should be noted that the first model may be an automatic codec (Auto-Encoder), i.e. comprising an Encoder part and a decoder part. After performing self-supervision training on the automatic codec based on the first NIR image and the generated second NIR image, the encoder section in the trained automatic codec is taken out, and the encoder section is used as a second model.

It will be appreciated that the first model is capable of generating and outputting an NIR image based on an input RGB image, while the second model includes only the encoder portion of the first model, and thus the second model cannot perform the same function as the first model.

In the embodiment of the disclosure, the first model is self-supervised and trained based on the first NIR image and the second NIR image, and the trained first model has higher model precision, and then the second model obtained based on the encoder part of the first model has a certain model precision.

Alternatively, the encoder section can be using ResNet18.

And step S104, performing two-class living body supervision training on the second model based on the second RGB image and the third NIR image to obtain a target model.

The target model is used for identifying an input image to be detected and outputting an image identification result of the image to be detected. For example, the target model may be applied to human face living body detection, that is, the target model is used for identifying a human face in an input image to be detected, and outputting whether the human face in the image to be detected is an image identification result based on real human face shooting.

The second RGB image may be different from the first RGB image, and the third NIR image may be different from the first NIR image.

In an embodiment of the disclosure, the second model is subjected to two-class in-vivo supervised training based on the second RGB image and the third NIR image to obtain a target model. The two-classification living body supervision training can enable the target model obtained through training to be used for identifying an input image to be detected so as to output an image identification result, for example, whether a face in the image to be detected is an image result shot based on a real face or not.

Optionally, taking the second RGB image and the third NIR image as inputs to the second model, obtaining a loss of the second RGB image and the third NIR image, and implementing two-class in-vivo supervised training of the second model based on the loss. For example, two-class living supervision training may be performed using two-class cross entropy loss to obtain a trained second model, i.e., the target model. Furthermore, the object model can perform image recognition on both the input RGB image and the NIR image. For example, the target model is applied to human face living body detection, and then the target model can detect human faces in the input RGB image or NIR image, namely whether the human faces in the input RGB image or NIR image are shot based on real human faces or not. Furthermore, the image to be detected input into the target model can be an RGB image or an NIR image, so that the detection of two different mode images can be realized through one model.

According to the scheme provided by the embodiment of the disclosure, a first RGB image is input into a first model, a second NIR image output by the first model is obtained, self-supervision training is performed on the first model based on the second NIR image and the first NIR image, an encoder part in the first model after training is extracted, the encoder part is used as a second model, and then two-class living body supervision training is performed on the second model based on the second RGB image and a third NIR image, so that a target model is obtained, and further the target model can realize image identification of an image to be detected. It will be appreciated that since the target model is trained based on RGB images and NIR images, the image to be detected that is input to the target model may be an RGB image or an NIR image.

In the embodiment of the disclosure, the two kinds of living body supervision training of the model are realized by inputting the mixed mode image data of the RGB image and the NIR image, so that the trained model can realize the identification of the RGB image and the NIR image at the same time, the model for identifying the RGB image and the model for identifying the NIR image are not required to be trained respectively, the model training investment is effectively saved, the model can be simultaneously applied to the identification of the RGB image and the NIR image through one model, the memory resource of a computer for installing the model can be further saved, and the popularization and the application of the model are more beneficial.

In addition, in the embodiment of the disclosure, the first model is self-supervised trained based on the real NIR image and the NIR image generated by the first model to obtain a trained first model, so that the trained first model has higher model precision, and the second model is an encoder part in the trained first model, so that the second model has higher model precision, so that the target model obtained based on the second model training also has higher model precision, and the image recognition result output by the target model is ensured to have higher accuracy.

Optionally, before the step S104, the method further includes:

and respectively carrying out different data enhancement processing on the first RGB image and the second NIR image, and carrying out self-supervision training on a second model based on the first RGB image after the data enhancement processing and the second NIR image after the data enhancement processing.

The second NIR image is generated by the first model based on the input first RGB image, and a corresponding relation exists between the first RGB image and the second NIR image.

In this embodiment of the present disclosure, different data enhancement processes are performed on the first RGB image and the second NIR image, for example, random rotation is used on the first RGB image, and random scaling is used on the second NIR image, so that the data richness of the image input into the second model can be effectively improved, which is more beneficial to improving the model precision of the trained second model.

Further, the second model is self-supervised trained based on the data enhancement processed first RGB image and the second NIR image. For example, the first RGB image and the second NIR image after the data enhancement processing are input into the second model, the loss between the first RGB image and the second NIR image is obtained, and the second model is self-supervised and trained until convergence is performed based on the loss, so that the trained second model can have higher model precision, and therefore the target model obtained based on the second model training can be ensured to have higher model precision, and the accuracy of the target model on the output image recognition result is ensured.

Optionally, the performing different data enhancement processing on the first RGB image and the second NIR image respectively, performing self-supervision training on a second model based on the first RGB image and the second NIR image after the data enhancement processing, including:

performing first data enhancement processing on the first RGB image to obtain a first image feature;

performing second data enhancement processing on the second NIR image to obtain a second image feature, wherein the first data enhancement processing is different from the second data enhancement processing;

And performing self-supervision training on the second model based on the first image feature and the second image feature.

In this embodiment, the first data enhancement process is different from the second data enhancement process, for example, the first RGB image may be randomly rotated to obtain the first image feature, and the second NIR image may be color perturbed to obtain the second image feature. Optionally, the first data enhancement process and the second data enhancement process may be other data enhancement process modes, which is not specifically limited in this disclosure.

Further, the second model is self-supervised trained based on the first image features and the second image features. For example, the first image feature and the second image feature are input into the second model, the loss between the first image feature and the second image feature is calculated, and self-supervision training is performed on the second model until convergence is performed on the basis of the loss, so that the trained second model has higher model precision.

Optionally, the self-supervised training of the second model based on the first image feature and the second image feature includes:

inputting the first image feature and the second image feature into the second model;

And performing self-supervision training on the second model based on the contrast learning loss function, wherein the self-supervision training is used for enabling the similarity between the first image feature and the second image feature to be larger than a preset similarity.

It should be noted that, there is a correspondence between the first RGB image and the second NIR image, for example, one first RGB image and one second NIR image are used as a set of sample data of the second model, and the second NIR image in the set of sample data is generated by the first model based on the corresponding first RGB image, that is, the first RGB image and the second NIR image are in two mode forms of the same image scene. The first image feature is obtained after the first RGB image is subjected to the first data enhancement processing, and the second image feature is obtained after the second NIR image is subjected to the second data enhancement processing, and the first RGB image and the second NIR image have a corresponding relationship, so that the first image feature and the second image feature obtained through different data enhancement processing have similarity.

In the embodiment of the disclosure, the first image feature and the second image feature are input into the second model, and the second model is subjected to self-supervision training by using a contrast learning loss function (for example, infoNCE loss) by utilizing a self-supervision principle, so that the first image feature and the second image feature are similar. And then, performing different data enhancement processing on the first RGB image and the second NIR image to obtain respective corresponding image features, and performing self-supervision training on the second model based on the respective corresponding image features so as to improve the model precision of the second model.

After the trained second model is obtained, performing two-classification living body supervision training on the trained second model based on the second RGB image and the third NIR image to obtain a target model, so that the target model has the same model precision as the trained second model, and the output accuracy of the target model is guaranteed.

Optionally, the method may further include:

performing image processing on a target image, wherein the image processing comprises:

detecting key points of the target image;

performing alignment processing on a target object in the target image based on the detected key points, and performing image preprocessing on the target image after the alignment processing;

wherein the target image is at least one of the first RGB image, the second NIR image, and the third NIR image.

For better understanding, the following specifically describes the image processing procedure by taking the target image as the first RGB image and the target model as an example applied to the face living body detection scene.

For example, after the first RGB image containing the face is obtained, the face may be detected by using an RGB detection model, so as to obtain an approximate area of the face in the first RGB image. The detection model may be an existing face detection model, and is used for detecting a face position in an image.

Further, according to the detected face region in the first RGB image, face key point detection may be performed on a face in the first RGB image by using an RGB face key point detection model to obtain coordinate values of face key points, face alignment may be performed on the face (i.e., the target object) in the first RGB image based on the coordinate values of the face key points, and image preprocessing may be performed on the first RGB image after the face alignment, for example, the image preprocessing is image normalization processing. The face key point detection model may be an existing model, the face key points may be 72 key points defined in advance, and specific positions of the face key point detection model are not limited herein. In addition, the face alignment processing may refer to performing translation, rotation, and other processing on the face so that the face area is in a specified area, and the specific process of the embodiment of the disclosure is not described in detail.

In the embodiment of the disclosure, the target object in the first RGB image is further enabled to be in a specific area by performing alignment processing on the target object in the first RGB image, so that the recognition and processing of the model on the first RGB image are facilitated.

The second RGB image, the second NIR image and the third NIR image may be also sampled to perform image processing before the corresponding models are input, so as to further facilitate the recognition and processing of the models on the images, and improve the recognition accuracy and processing accuracy of the models.

Optionally, the aligning the target object in the target image based on the detected key point includes:

acquiring a first coordinate corresponding to the key point and a second coordinate corresponding to the key point of the reference object;

and determining an affine transformation matrix based on the first coordinates and the second coordinates, and performing alignment processing on a target object in the target image based on the affine transformation matrix.

For better understanding, the target image is taken as the first RGB image, and the face in the first RGB image of the target object is illustrated.

Illustratively, after the face key points in the first RGB image are obtained based on the face key point detection model, first coordinates of the face key points, that is, first coordinates of the face key points in the first RGB image, are obtained. For example, the first coordinates of the face key points may be determined with the upper left corner of the first RGB image as the origin of coordinates. For example, if the number of face key points is 72, the first coordinates (x ₁ ,y ₁ )…(x ₇₂ ,y ₇₂ ). In addition, second coordinates corresponding to the reference face key points are obtained, and similarly, the second coordinates (x ₁ ’,y ₁ ’)…(x ₇₂ ’,y ₇₂ '). The 72 face key points in the first RGB image have a one-to-one correspondence with the 72 reference face key points. Furthermore, based on 72 face key points and 72 reference face key points in the first RGB image, an affine transformation matrix can be obtained by calculation, and the respective first coordinates of the 72 face key points in the first RGB image can be remapped to new coordinates according to the affine transformation matrix, so that face alignment of faces in the first RGB image is achieved.

The second RGB image, the second NIR image, and the third NIR image may also be sampled to perform the alignment process of the target object in the above manner.

In the embodiment of the disclosure, the affine transformation matrix is determined through the first coordinates of the key points of the target object and the second coordinates of the key points of the reference object, so that alignment processing of the target object is realized, further recognition and processing of the model on the image are facilitated, and the recognition accuracy of the model is improved.

Optionally, the image preprocessing of the target image after the alignment processing includes:

And carrying out pixel value preprocessing on each pixel in the target image after the alignment processing so that the pixel value of each pixel after the preprocessing is positioned in a preset threshold value interval.

The image preprocessing may be, for example, an image normalization process. For example, the pixel value of each pixel in the target image after the alignment process is divided by 128 and divided by 256, so that the pixel value of each pixel is between [ -0.5,0.5 ]. Therefore, the pixel value of each pixel in the target image can be ensured to be positioned in the preset threshold value interval, the recognition of the model on the target image is more beneficial, and the recognition error caused by the over-bright or over-dark pixel value in the target image is avoided.

Furthermore, before the normalized target image is input into the model, random data enhancement processing such as random rotation, random scaling, color disturbance and the like can be performed on the normalized target image, so that the richness of the image data is improved, the model is helped to learn more image features, and model training is helped.

Referring to fig. 2, fig. 2 is a flowchart of another model training method according to an embodiment of the disclosure, as shown in fig. 2, the method includes three steps, specifically:

The first step: face detection is carried out on RGB face images, face alignment is carried out on detected faces, then image preprocessing is carried out on the RGB images after face alignment, the preprocessed RGB images are input into a coder-decoder (codec), NIR face images generated by the codec are obtained, loss (loss) between the NIR face images and the generated NIR face images is obtained, and training is carried out on the codec based on the loss.

And a second step of: after RGB face image is subjected to face detection, face alignment, image preprocessing and first data enhancement processing, the RGB face image is input into an encoder which is an encoder part in the encoder-decoder; and simultaneously, after the NIR face image generated in the first step is subjected to face detection, face alignment, image preprocessing and first data enhancement processing, inputting the NIR face image into the encoder, acquiring contrast learning loss between the input RGB image and the generated NIR image, and training the encoder based on the loss.

And a third step of: the mixed mode data comprising RGB face images and NIR face images are input into the trained encoder after face detection, face alignment, image preprocessing and data enhancement processing, and the encoder is trained based on the loss through the processing of the global average pooling layer and the full connection layer of the encoder to obtain the two-class cross entropy loss.

It should be noted that, the specific implementation process and the related concept of the embodiment of the disclosure may refer to the specific description in the embodiment of fig. 1, and this embodiment can achieve the same technical effects as the embodiment of fig. 1, so that the repetition is avoided and will not be repeated here.

Referring to fig. 3, fig. 3 is a flowchart of an image recognition method according to an embodiment of the disclosure, as shown in fig. 3, the method includes the following steps:

step S301, obtaining an image to be detected;

step S302, inputting the image to be detected into a target model, and acquiring an image recognition result output by the target model.

The target model is a model obtained by performing two-class living body supervision training on a second model based on a second RGB image and a third NIR image, the second model is an encoder part in a first model, the first model is a model obtained by performing self-supervision training based on the second NIR image and a first NIR image, and the second NIR image is an NIR image output by the first model after the first RGB image is input into the first model. That is, the target model is a model trained based on the model training method.

The target model may be applied to living human face detection, and after an input image to be detected is obtained, the target model may identify a human face in the image to be detected, so as to identify whether the human face in the image to be detected is photographed based on a real person, and output an image identification result. It may be understood that the image recognition result may be a text representation form, which is used to represent whether the face in the image to be detected is photographed based on a real person, or not, and of course, may be other possible forms, which is not specifically limited in the present disclosure.

The target model may be used for other image recognition scenes, for example, for recognizing whether the image to be detected includes a target face, etc.

In the embodiment of the disclosure, the target model is obtained by training based on the model training method, so that the target model can identify the RGB image and the NIR image, and further can identify and detect two different mode images through one model, the application range of the target model is effectively improved, and the memory of the electronic equipment for installing the target model is effectively saved.

Referring to fig. 4, fig. 4 is a block diagram of a model training apparatus according to an embodiment of the disclosure, and as shown in fig. 4, a model training apparatus 400 includes:

a first acquisition module 401 for acquiring a first RGB image and a first NIR image;

a first training module 402, configured to input the first RGB image into a first model, obtain a second NIR image output by the first model, and perform self-supervision training on the first model based on the second NIR image and the first NIR image;

a second obtaining module 403, configured to obtain the trained encoder part in the first model, and take the encoder part as a second model;

A second training module 404, configured to perform two-class living body supervision training on the second model based on the second RGB image and the third NIR image, to obtain a target model;

Optionally, the apparatus further comprises:

and the third training module is used for respectively carrying out different data enhancement processing on the first RGB image and the second NIR image, and carrying out self-supervision training on a second model based on the first RGB image after the data enhancement processing and the second NIR image after the data enhancement processing.

Optionally, the third training module is further configured to:

Optionally, the apparatus further comprises:

the image processing module is used for performing image processing on the target image, wherein the image processing module is specifically used for:

detecting key points of the target image;

Optionally, the image processing module is further configured to:

It should be noted that, the model training apparatus 400 provided in the embodiment of the present disclosure can implement all the technical processes of the model training method described in fig. 1 and achieve the same technical effects, and is not repeated here.

Referring to fig. 5, fig. 5 is a block diagram of an image recognition apparatus according to an embodiment of the disclosure, and as shown in fig. 5, an image recognition apparatus 500 includes:

a third acquiring module 501, configured to acquire an image to be detected;

the recognition module 502 is configured to input the image to be detected into a target model, and obtain an image recognition result output by the target model;

It should be noted that, the image recognition device 500 provided in the embodiment of the present disclosure can implement all the technical processes of the image recognition method described in fig. 3 and achieve the same technical effects, and for avoiding repetition, the description is omitted herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 performs the respective methods and processes described above, such as the model training method or the image recognition method described above. For example, in some embodiments, the model training method or the image recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the model training method or the image recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a model training method or an image recognition method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A model training method, comprising:

acquiring a first RGB image and a first NIR image, wherein the first RGB image and the first NIR image correspond to the same image scene;

2. The method of claim 1, wherein the second model is subjected to two-class in-vivo supervised training based on the second RGB image and the third NIR image, and the method further comprises, prior to deriving the target model:

3. The method of claim 2, wherein the performing different data enhancement processing on the first RGB image and the second NIR image, respectively, and performing self-supervised training on a second model based on the data enhanced first RGB image and the second NIR image, comprises:

4. A method according to claim 3, wherein the self-supervised training of the second model based on the first and second image features comprises:

5. The method of claim 1, further comprising:

detecting key points of the target image;

6. The method of claim 5, wherein the aligning the target object in the target image based on the detected keypoints comprises:

7. The method of claim 5, wherein the aligning the processed target image for image preprocessing comprises:

8. An image recognition method, comprising:

acquiring an image to be detected;

The target model is a model obtained by performing two-class living body supervision training on a second model based on a second RGB image and a third NIR image, the second model is an encoder part in a first model, the first model is a model obtained by performing self-supervision training based on the second NIR image and a first NIR image, the second NIR image is an NIR image output by the first model after the first RGB image is input into the first model, and the first RGB image and the first NIR image correspond to the same image scene.

9. A model training apparatus comprising:

the first acquisition module is used for acquiring a first RGB image and a first NIR image, wherein the first RGB image and the first NIR image correspond to the same image scene;

10. The apparatus of claim 9, further comprising:

11. The apparatus of claim 10, wherein the third training module is further to:

12. The apparatus of claim 11, wherein the third training module is further configured to:

13. The apparatus of claim 9, further comprising:

detecting key points of the target image;

14. The apparatus of claim 13, wherein the image processing module is further to:

15. The apparatus of claim 13, wherein the image processing module is further to:

16. An image recognition apparatus comprising:

the third acquisition module is used for acquiring an image to be detected;

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.