CN110674748B

CN110674748B - Image data processing method, apparatus, computer device, and readable storage medium

Info

Publication number: CN110674748B
Application number: CN201910907706.9A
Authority: CN
Inventors: 沈鹏程; 李绍欣; 吴佳祥; 邰颖; 李季檩
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2024-02-13
Anticipated expiration: 2039-09-24
Also published as: CN110674748A

Abstract

The embodiment of the application provides an image data processing method, an image data processing device and a computer readable storage medium, and belongs to the field of artificial intelligence, wherein the method comprises the following steps: acquiring a source image, and acquiring a first facial feature corresponding to the source image based on the identification model; acquiring physical sign state conditions, acquiring a predicted image of a source image under the physical sign state conditions based on a generation model, and acquiring a second facial feature corresponding to the predicted image based on an identification model; carrying out feature fusion on the first facial feature and the second facial feature in the same dimension to obtain retrieval features; and carrying out feature matching on the same dimension on the retrieval feature and a third facial feature corresponding to the facial image contained in the facial image library to obtain a matching result, and determining a target image corresponding to the source image based on the matching result. By adopting the embodiment of the application, the accuracy of image recognition can be improved.

Description

Image data processing method, apparatus, computer device, and readable storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to an image data processing method, an image data processing device, and a computer readable storage medium.

Background

Along with the continuous development of the face recognition technology, the field of face recognition application is more and more, for example, in public welfare seeking, the face recognition technology can be adopted to determine the household photo matched with the person to be turned from household photos stored in a related system, so as to find the person to be turned.

However, most of the cranked persons are cranked and sold in the childhood, the faces of the cranked persons are greatly changed along with the increase of the cranked time, and the photos in the childhood are used for face recognition, so that the matching degree between the photos in the childhood and the household photos is too low, namely the household photos matched with the photos in the childhood are difficult to find, and the accuracy of image recognition is too low.

Disclosure of Invention

The embodiment of the application provides an image data processing method, an image data processing device and a computer readable storage medium, which can improve the accuracy of image recognition.

In one aspect, an embodiment of the present application provides an image data processing method, including:

acquiring a source image, and acquiring a first facial feature corresponding to the source image based on an identification model;

acquiring a sign state condition, acquiring a predicted image of the source image under the sign state condition based on a generation model, and acquiring a second facial feature corresponding to the predicted image based on the identification model;

Carrying out feature fusion on the first facial feature and the second facial feature in the same dimension to obtain retrieval features;

and carrying out feature matching on the same dimension on the retrieval feature and a third facial feature corresponding to the facial image contained in the facial image library to obtain a matching result, and determining a target image corresponding to the source image based on the matching result.

The acquiring the source image, acquiring the first facial feature corresponding to the source image based on the recognition model, includes:

acquiring the source image, carrying out face detection on the source image, and determining a target face area as a preprocessing image corresponding to the source image when the existence of the target face area in the source image is detected;

inputting the preprocessed image into a recognition model, and extracting the first facial feature from the preprocessed image based on the recognition model.

The step of acquiring the source image, performing face detection on the source image, and determining the face area in the source image as a preprocessed image when the face area in the source image is detected to exist, includes:

acquiring the source image, generating a candidate region corresponding to the source image, and performing face detection on the candidate region;

If the face contour exists in the candidate area, determining the candidate area with the face contour as a candidate face area;

and carrying out regression correction on the candidate face area to obtain the target face area, and determining the target face area as a preprocessed image corresponding to the source image.

The obtaining the sign state condition, obtaining a predicted image of the source image under the sign state condition based on a generating model, and obtaining a second facial feature corresponding to the predicted image based on the identifying model includes:

inputting the source image into an encoder for generating a model, and acquiring facial identification features corresponding to the source image based on the encoder;

acquiring the physical sign state conditions, and splicing the state features corresponding to the physical sign state conditions with the facial identification features to obtain joint features;

inputting the joint features to a decoder of the generation model, and generating a predicted image of the source image under the physical sign state condition based on the decoder;

inputting the predicted image into the recognition model, and extracting the second facial feature from the predicted image based on the recognition model.

The step of fusing the first facial feature and the second facial feature in the same dimension to obtain a retrieval feature includes:

acquiring a first weight parameter corresponding to the first facial feature and a second weight parameter corresponding to the second facial feature;

and carrying out feature fusion on the first facial feature and the second facial feature on the same dimension based on the first weight parameter and the second weight parameter to obtain retrieval features.

The step of matching the retrieval feature with a third facial feature corresponding to the facial image contained in the facial image library to obtain a matching result, and determining a target image corresponding to the source image based on the matching result comprises the following steps:

based on the recognition model, acquiring third facial features corresponding to each facial image in the facial image library;

performing feature matching on the same dimension on the retrieval feature and the third facial feature to obtain matching values of the retrieval feature and the third facial feature in each dimension, and determining the matching result based on the matching values corresponding to each dimension;

And sorting each face image based on the matching result, and selecting the target image from the sorted face images according to the sorting order.

Wherein the dimensions include a critical dimension and a conventional dimension; the matching values corresponding to the dimensions comprise key matching values and conventional matching values;

and performing feature matching on the same dimension on the search feature and the third facial feature to obtain matching values of the search feature and the third facial feature in each dimension, and determining the matching result based on the matching values corresponding to each dimension, including:

acquiring key matching values of the retrieval feature and the third facial feature in the same key dimension, and acquiring conventional matching values of the retrieval feature and the third facial feature in the same conventional dimension;

acquiring a first matching weight corresponding to the key dimension and a second matching weight corresponding to the conventional dimension;

and determining the matching result according to the first matching weight, the key matching value, the second matching weight and the second matching weight.

Wherein the method further comprises:

determining relative object information associated with the source image, and acquiring relative biological characteristics corresponding to the relative object information;

And acquiring a target biological feature associated with the target image, and determining that the source image and the target image have the same identity information when the target biological feature and the related biological feature have an association relationship.

Wherein the method further comprises:

acquiring a first sample data set and a second sample data set; the first sample data set comprises a corresponding relation between a first sample image and first identity tag information, and the second sample data set comprises a corresponding relation between a second sample image, second identity tag information and sample sign states;

determining a first similarity curve corresponding to the first sample data set and a second similarity curve corresponding to the second sample data set based on an initial recognition model;

and correcting the network parameters of the initial recognition model according to the first similarity curve and the second similarity curve, and determining the initial recognition model containing the corrected network parameters as the recognition model.

Wherein the determining, based on the initial recognition model, a first similarity curve corresponding to the first sample data set and a second similarity curve corresponding to the second sample data set includes:

Pairing and combining the first sample images contained in the first sample data set to obtain a first sample pair set, and pairing and combining the second sample images contained in the second sample data set to obtain a second sample pair set;

acquiring a first similarity between sample images contained in each first sample pair in the first sample pair set and a second similarity between sample images contained in each second sample pair in the second sample pair set based on the initial recognition model;

determining the first similarity curve corresponding to the first sample data set based on the first similarity, and determining the second similarity curve corresponding to the second sample data set based on the second similarity.

Wherein the correcting the network parameters of the initial recognition model according to the first similarity curve and the second similarity curve, and determining the initial recognition model including the corrected network parameters as the recognition model includes:

determining a loss function corresponding to the initial recognition model based on the first similarity curve and the second similarity curve;

And correcting the network parameters of the initial recognition model according to the loss function, and determining the initial recognition model containing the corrected network parameters as the recognition model.

Wherein the method further comprises:

acquiring a second sample image contained in the second sample data set, and acquiring sample identification features corresponding to the second sample image based on an initial encoder in an initial generation model;

acquiring a sample body sign state condition, inputting the sample body sign state condition and the sample identification feature into an initial decoder of the initial generation model, and generating a sample prediction image of the second sample image under the sample body sign state condition based on the initial decoder;

and training the initial generation model based on the second sample image and the sample prediction image, and determining the trained initial generation model as the generation model.

An aspect of an embodiment of the present application provides an image data processing apparatus, including:

the first acquisition module is used for acquiring a source image and acquiring a first facial feature corresponding to the source image based on the identification model;

the second acquisition module is used for acquiring physical sign state conditions, acquiring a predicted image of the source image under the physical sign state conditions based on a generation model and acquiring a second facial feature corresponding to the predicted image based on the identification model;

The feature fusion module is used for carrying out feature fusion on the first facial feature and the second facial feature in the same dimension to obtain retrieval features;

and the retrieval module is used for carrying out feature matching on the same dimension on the retrieval feature and a third facial feature corresponding to the facial image contained in the facial image library to obtain a matching result, and determining a target image corresponding to the source image based on the matching result.

Wherein, the first acquisition module includes:

the preprocessing unit is used for acquiring the source image, carrying out face detection on the source image, and determining the target face area as a preprocessed image corresponding to the source image when the target face area exists in the source image;

and the first feature extraction unit is used for inputting the preprocessed image into a recognition model, and extracting the first facial feature from the preprocessed image based on the recognition model.

Wherein the preprocessing unit includes:

a face detection subunit, configured to acquire the source image, generate a candidate region corresponding to the source image, and perform face detection on the candidate region;

a candidate region determination subunit, configured to determine, if a face contour is detected to exist in the candidate region, the candidate region in which the face contour exists as a candidate face region;

And the target area determining subunit is used for carrying out regression correction on the candidate face area to obtain the target face area, and determining the target face area as a preprocessed image corresponding to the source image.

Wherein the second acquisition module includes:

the encoding unit is used for inputting the source image into an encoder for generating a model, and acquiring facial discrimination characteristics corresponding to the source image based on the encoder;

the splicing unit is used for acquiring the physical sign state conditions, and splicing the state features corresponding to the physical sign state conditions with the facial identification features to obtain joint features;

a decoding unit for inputting the joint feature to a decoder of the generation model, and generating a predicted image of the source image under the condition of the sign state based on the decoder;

and a second feature extraction unit configured to input the predicted image into the recognition model, and extract the second facial feature from the predicted image based on the recognition model.

Wherein, the feature fusion module includes:

the parameter acquisition unit is used for acquiring a first weight parameter corresponding to the first facial feature and a second weight parameter corresponding to the second facial feature;

And the retrieval feature determining unit is used for carrying out feature fusion on the first facial feature and the second facial feature in the same dimension based on the first weight parameter and the second weight parameter to obtain retrieval features.

Wherein, the retrieval module includes:

a library feature acquisition unit, configured to acquire third facial features corresponding to each facial image in the facial image library, based on the recognition model;

the matching result determining unit is used for performing feature matching on the same dimension on the retrieval feature and the third facial feature to obtain matching values of the retrieval feature and the third facial feature in each dimension, and determining the matching result based on the matching values corresponding to each dimension;

and the target image selection unit is used for sorting each face image based on the matching result and selecting the target images from the sorted face images according to the sorting order.

the matching result determining unit is specifically configured to:

Wherein the apparatus further comprises:

the biological characteristic acquisition module is used for determining relative object information associated with the source image and acquiring relative biological characteristics corresponding to the relative object information;

and the biological feature comparison module is used for acquiring a target biological feature associated with the target image, and determining that the source image and the target image have the same identity information when the target biological feature and the related biological feature have an association relationship.

Wherein the apparatus further comprises:

a sample data acquisition module for acquiring a first sample data set and a second sample data set; the first sample data set comprises a corresponding relation between a first sample image and first identity tag information, and the second sample data set comprises a corresponding relation between a second sample image, second identity tag information and sample sign states;

The similarity curve determining module is used for determining a first similarity curve corresponding to the first sample data set and a second similarity curve corresponding to the second sample data set based on an initial identification model;

and the network parameter correction module is used for correcting the network parameters of the initial recognition model according to the first similarity curve and the second similarity curve, and determining the initial recognition model containing the corrected network parameters as the recognition model.

Wherein, the similarity curve determining module includes:

a combination unit, configured to perform pairwise combination on the first sample images included in the first sample data set to obtain a first sample pair set, and perform pairwise combination on the second sample images included in the second sample data set to obtain a second sample pair set;

a similarity determining unit, configured to obtain, based on the initial recognition model, a first similarity between sample images included in each first sample pair in the first sample pair set, and a second similarity between sample images included in each second sample pair in the second sample pair set;

and the curve determining unit is used for determining the first similarity curve corresponding to the first sample data set based on the first similarity and determining the second similarity curve corresponding to the second sample data set based on the second similarity.

Wherein, the network parameter correction module includes:

a loss function determining unit, configured to determine a loss function corresponding to the initial recognition model based on the first similarity curve and the second similarity curve;

and the correction unit is used for correcting the network parameters of the initial recognition model according to the loss function, and determining the initial recognition model containing the corrected network parameters as the recognition model.

Wherein the apparatus further comprises:

the sample characteristic acquisition module is used for acquiring a second sample image contained in the second sample data set, and acquiring a sample identification characteristic corresponding to the second sample image based on an initial encoder in an initial generation model;

a sample prediction image generating module, configured to obtain a sample sign state condition, input the sample sign state condition and the sample identification feature into an initial decoder of the initial generation model, and generate a sample prediction image of the second sample image under the sample sign state condition based on the initial decoder;

and the training module is used for training the initial generation model based on the second sample image and the sample prediction image, and determining the initial generation model after training as the generation model.

An aspect of the embodiments of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the computer program is executed by the processor, so that the processor performs the steps of the method in an aspect of the embodiments of the present application.

An aspect of the present embodiments provides a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, perform the steps of a method as described in an aspect of the embodiments of the present application.

According to the method, the first facial features corresponding to the source image are obtained through the recognition model, the sign state conditions are obtained, the prediction image of the source image under the sign state conditions is obtained through the generation model, the second facial features corresponding to the prediction image are obtained through the recognition model, feature fusion on the same dimension is conducted on the first facial features and the second facial features, search features for search comparison are obtained, and then searching is conducted in a facial image library based on the search features, and the target image matched with the source image is determined. According to the method, the prediction image of the source image under the condition of the sign state is generated through the generation model, and then the first facial feature acquired from the source image by the recognition model is fused with the second facial feature acquired from the prediction image, so that the fused features can more accurately represent the information in the source image, and the accuracy of image recognition can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a network architecture diagram provided in an embodiment of the present application;

fig. 2 is a schematic view of a scenario of an image data processing method according to an embodiment of the present application;

fig. 3 is a flowchart of an image data processing method according to an embodiment of the present application;

fig. 4 is a schematic diagram of an image processing flow provided in an embodiment of the present application;

FIG. 5 is a flowchart of another image data processing method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of training an initial recognition model provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a training initial decoder provided by an embodiment of the present application;

fig. 8 is a schematic structural view of an image data processing apparatus according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The solution provided by the embodiment of the application belongs to Computer Vision technology (CV) and Machine Learning (ML) which belong to the field of artificial intelligence.

The computer vision technology is a science for researching how to make a machine "see", and further means that a camera and a computer are used for replacing human eyes to perform machine vision such as recognition, tracking and measurement on a target, and further performing graphic processing, so that the computer is processed into an image which is more suitable for human eyes to observe or transmit to an instrument to detect. In particular, the present application relates to face recognition (Human Face Recognition) in computer vision, which is a biometric technology for performing identity recognition based on facial feature information of a person. A series of related technologies, commonly referred to as image recognition and face recognition, are adopted to collect an image or video stream containing a face by using a camera or a camera, and automatically detect and track the face in the image or video stream, thereby performing face recognition on the detected face.

Please refer to fig. 1, which is a network architecture diagram provided in an embodiment of the present application. The network architecture may include a server 200a and a plurality of terminal devices (as shown in fig. 1, specifically including a terminal device 100a, a terminal device 100b, and a terminal device 100 c), and the server 200a may perform data transmission with each terminal device through a network.

Taking the terminal device 100a as an example, when the terminal device 100a acquires a source image input by a user and a sign state condition selected by the user, the terminal device 100a may transmit the acquired source image and sign state condition to the server 200a. The server 200a may obtain a first facial feature corresponding to the source image based on the recognition model, and may obtain a second facial feature corresponding to the predicted image based on the generated model predicting the predicted image of the source image under the condition of the sign state, by using the recognition model, the server 200a may perform feature fusion on the first facial feature and the second facial feature in the same dimension to obtain a search feature for search comparison, and compare a matching result between the search feature and each facial image feature in the facial image library, so that M (M is a positive integer, such as m=5) facial image features with the highest matching result respectively correspond to the images as a target image matched with the source image. After determining the target image, the server 200a may return the target image to the terminal device 100a, so that the terminal device 100a saves the returned target image.

Of course, if the terminal device 100a integrates the face recognition function and the image generation function, the terminal device 100a may directly extract the first facial feature corresponding to the source image, generate the predicted image of the source image under the condition of the sign state, further extract the second facial feature corresponding to the predicted image, and directly determine the target image matched with the source image from the facial image library based on the search feature after the first facial feature and the second facial feature are fused. The following will specifically describe how the terminal device 100a generates a predicted image of the source image under the condition of the sign state, and extracts a first facial feature corresponding to the source image and a second facial feature corresponding to the predicted image. The terminal device 100a, the terminal device 100b, the terminal device 100c, and the like may include a mobile phone, a tablet computer, a notebook computer, a palm computer, a mobile internet device (mobile internet device, MID), a wearable device (e.g., a smart watch, a smart bracelet, and the like), and the like.

Fig. 2 is a schematic view of a scenario of an image data processing method according to an embodiment of the present application. Taking a cross-age person seeking scene as an example, when a missing person (hereinafter, described as "small a") is lost accidentally during young people and a family looks for no fruit for many years, as shown in fig. 2, by uploading an image 20a (the image 20a is a face image containing the complete face of the small a) before the small a is lost to a terminal device, after the image 20a is successfully uploaded, the terminal device can acquire the image 20a and take the image 20a as a source image.

The terminal device may acquire a recognition model 20b, where the recognition model 20b (may also be referred to as a cross-age face recognition model) has already completed training based on face images of different age stages, and has a function of cross-age face recognition, and based on the recognition model 20b, facial features in the face images may be extracted, and the extracted facial features are insensitive to age. The terminal device may input the image 20a into the recognition model 20b, and based on the recognition model 20b, may extract the facial feature 20c from the image 20a, where the facial feature 20a is attribute information for describing a face included in the image 20a, that is, may describe a facial feature of a small a in the image 20a through the facial feature 20a, and may include information such as an eye size, a mouth size, a nose size, a chin shape, and a face contour shape of the small a in the facial feature 20 a.

As the face of the small A is changed greatly after the small A is lost for many years, the small A is not beneficial to searching. After the user uploads the image 20a to the terminal device, the current age of the small a may also be input in the terminal device, and the terminal device may generate a predicted image 20g of the small a under the current age condition based on the image 20a uploaded by the user and the input current age. The specific generation process of the predicted image 20g can be expressed as: the terminal equipment acquires the generation model 20d, inputs the image 20a into the generation model 20d, and encodes the image 20a based on an encoder 20e in the generation model 20d to obtain facial recognition features corresponding to the image 20a (the facial recognition features can be used for distinguishing the face of the small A in the image 20a from the rest of the faces in the rest of the images); the terminal device may acquire an age code (may also be referred to as a state feature corresponding to a physical sign state condition) corresponding to the current age of the small a, splice the age code with the face identification feature obtained by the encoder 20e, input the spliced joint feature into the decoder 20f of the generating model 20d, and may up-sample the input feature based on the decoder 20f to generate the predicted image 20g of the small a under the current age.

The generation model 20d (may also be referred to as a cross-age generation model) has already been trained based on the generation countermeasure network, and has an image generation function under a specific age condition, that is, a face image under any age condition can be generated based on the generation model 20 d. The age code may be a single hot code (one-hot code), that is, only the vector having the position corresponding to the current age of 1 and the remaining positions of 0, where the dimension of the age code is the age range preset by the terminal device based on actual requirements, for example, the age range set by the terminal device is 1-30 years old, and the age code may be a 30-dimensional vector, where when the current age of the small a is 10 years old, the age code is a vector having the 10 th position of 1 and the remaining positions of 0, and based on the generation model 20d, the predicted image 20g of the small a when the age is 10 years old may be generated.

After obtaining the predicted image 20g based on the generation model 20d, the terminal device may input the predicted image 20g into the predicted image 20b, and based on the recognition model 20b, facial features 20h may be extracted from the predicted image 20g, which facial features 20h may be used to describe face information of the medium a in the predicted image 20g. Both facial features 20c and 20h may be used to describe the facial information of the small a, except that facial feature 20c is used to describe the facial information of the small a before the small a gets lost, i.e., during the childhood, and facial feature 20h is used to describe the facial information of the small a at the current age.

The terminal device may fuse the facial feature 20c and the facial feature 20h to obtain a fused feature 20i (may also be referred to as a search feature), where the fused feature 20i may more fully describe the facial features of the small a. Based on the fusion feature 20i, search comparison can be performed in the student status photo library 20j of the age-appropriate student provided by the relevant department, and the face photo set 20k with the highest matching result with the fusion feature 20i can be determined. The terminal device may extract the face feature corresponding to each photo in the school photograph library 20j based on the recognition model 20b, for example, the school photograph library 20j includes school photographs corresponding to the students of proper ages respectively, which are respectively represented as a face image 1, a face image 2, and a face image 3, …, and then may extract the face feature 1 corresponding to the face image 1 and the face feature 2, … corresponding to the face image 2 based on the recognition model 20 b. By calculating the matching result between the fusion feature 20i and the face feature corresponding to each photo in the school photo library 20j, the top 5 face images with the highest matching result are used as the face photo set 20k, and the person in each photo in the face photo set 20k can be used as a suspected target (i.e. may be small a). The suspected objects corresponding to the face photo collection 20k are further identified by the relevant departments, for example, the object image 20m is determined by biological feature detection. In other words, the person in the target image 20m is the small a.

Fig. 3 is a schematic flow chart of an image data processing method according to an embodiment of the present application. As shown in fig. 3, the image data processing method may include:

step S101, acquiring a source image, and acquiring a first facial feature corresponding to the source image based on an identification model;

specifically, the terminal device may acquire a source image (corresponding to the image 20a in the embodiment corresponding to fig. 2 above), input the source image into a trained recognition model (corresponding to the recognition model 20b in the embodiment corresponding to fig. 2 above), and extract the first facial feature (corresponding to the facial feature 20c in the embodiment corresponding to fig. 2 above) from the source image based on the recognition model. The source image may refer to an image containing a face, and when the source image is a face image, the recognition model may also be referred to as a face recognition model, that is, the recognition model may be used to extract a face feature, and the recognition model may be formed by a convolutional neural network. Of course, the source image may also refer to an animal face image, such as an image of an animal such as a cat, a dog, etc., and when the source image is a face image of a certain animal, the recognition model at this time may be used to extract facial feature information of the animal (including information of eyes, hair, mouth, etc. of the animal face).

Optionally, the terminal device may also perform preprocessing such as face detection, facial feature point positioning, and the like on the source image before inputting the source image into the recognition model. After the terminal equipment acquires the source image, candidate areas corresponding to the source image can be generated, and face detection is carried out on the candidate areas; if the face contour exists in the candidate area, determining the candidate area with the face contour as a candidate face area; and carrying out regression correction on the candidate face areas to obtain target face areas in the source images, determining the target face areas as preprocessed images corresponding to the source images, inputting the preprocessed images into a recognition model, and extracting first face features from the preprocessed images based on the recognition model. The face detection and the face key point may use models such as Single-shot Multi-box Detector (SSD), multi-task convolutional neural network (Multi-task convolutional neural networks, MTCNN), and the like, which are not particularly limited herein.

Taking a face image as an example, an MTCNN network model can be used for face detection and feature point positioning of the source image, the MTCNN network model is of a cascade structure and can comprise a three-stage convolution neural network, the first-stage convolution network can be composed of convolution layers, and the second-stage convolution network and the third-stage convolution network can be composed of convolution layers and full-connection layers. The terminal equipment can carry out multi-scale size adjustment on the source image to form an image pyramid, namely, the source image is adjusted into a plurality of images with different sizes, the image size at the top end of the pyramid is minimum, such as 13 x 13, and the image size at the bottom end of the pyramid is maximum, such as the image size of the source image. Each image in the image pyramid can be input into an MTCNN network, a candidate window (also referred to as a candidate region) can be quickly generated based on the MTCNN network, further, a candidate face window (namely, a candidate window with a face and also referred to as a candidate face region) is obtained from the candidate window, and the candidate face window can be calibrated based on frame regression (Bounding-Box regression), namely, the obtained candidate face window is subjected to fine adjustment, so that the candidate face window is closer to a given expected window. If the candidate face window 1 may be represented by using a four-dimensional vector (0.8,0.8,1.0,1.0), the expected window corresponding to the candidate face window 1 may be represented as (1.0,1.0,1.4,1.4), then the candidate face window 1 may be calibrated based on frame regression, and the window after calibration may be represented as (0.9,0.9,1.2,1.2), where it is seen that the window (0.9,0.9,1.2,1.2) after calibration based on frame regression is closer to the expected window (1.0,1.0,1.4,1.4), where the first dimension in the four-dimensional vector is represented as the abscissa of the window center position, the second dimension is represented as the ordinate of the window center position, the third dimension is represented as the width of the window, and the fourth dimension is represented as the height of the window. After performing frame regression calibration on the candidate face window, non-maximum suppression (non maximum suppression, NMS) may be used to screen the calibrated candidate face window to remove redundant candidate face windows. If the overlap degree between the candidate face window 1 and the candidate face window 2 is detected to be larger than the threshold value, one candidate face window can be deleted from the candidate face window 1 and the candidate face window 2. Based on a three-stage convolution network in the MTCNN network, a target face region containing the whole face in the source image and the positions of 5 basic feature points in the source image can be finally obtained, and the target face image carrying the positions of the 5 basic feature points is determined to be a preprocessing image. In other words, the main purpose of using face detection and feature point positioning is to perform face detection on a source image, if no face exists in the source image, or only a small part of the face (such as the forehead part of the face, the mouth part of the face, and the nose part of the face) is included, or the included face image is too small (such as the face width included in the source image is only 5% of the face width of the source image, and the facial feature information of the face cannot be clearly distinguished), it may be determined that the source image is an invalid image (i.e., an image in which no face exists), and the subsequent processing of the source image is stopped, i.e., the invalid image does not need to be input into a recognition model for feature extraction. If the face exists in the source image, frame regression and non-maximum suppression can be performed on a candidate window corresponding to the source image, and finally the target face area in the source image and the positions of 5 basic feature points (including mouth, nose and eyes) corresponding to the target face area are obtained, wherein the target face area carrying the 5 basic feature points can be determined as a preprocessed image corresponding to the source image. The faces contained in the preprocessed image are all positive faces, and the faces are located in the center of the image, and the preprocessed image can also be called a standardized face image.

The terminal equipment inputs the preprocessed image into the recognition model, and then the first facial feature of the face in the source image can be extracted, if the source image is the face image of the user 1 in the childhood period, the face detection and the feature point positioning are carried out on the source image, then a standardized child face image can be obtained, and the standardized child face image is input into the recognition model, and then the face feature of the user 1 in the childhood period can be extracted. The recognition model may include a plurality of convolution layers, and the convolution layers are used to perform convolution operation on the input image (i.e., the preprocessed image) to obtain a feature map (i.e., the first facial feature) corresponding to the input image.

Step S102, acquiring a sign state condition, acquiring a predicted image of the source image under the sign state condition based on a generation model, and acquiring a second facial feature corresponding to the predicted image based on the identification model;

specifically, the terminal device may further input the source image into a generation model, acquire a facial recognition feature corresponding to the source image based on an encoder in the generation model, acquire a physical sign state condition corresponding to the source image, determine a state feature corresponding to the physical sign state condition, and splice the state feature and the facial recognition feature to obtain a joint feature; the joint feature is input into a decoder for generating a model, a predicted image of the source image under the condition of the sign state can be generated based on the decoder, the predicted image can be input into a recognition model, and the second facial feature is extracted from the predicted image based on the recognition model.

The first facial feature, the second facial feature, and the facial recognition feature in the embodiment of the present application are feature vectors for describing facial information of an image, but the facial features (i.e., the first facial feature and the second facial feature) are different from the facial recognition feature in terms of description of facial information, and the emphasis of tendency is different. The first facial feature and the second facial feature are obtained based on a recognition model, and the first facial feature and the second facial feature are more prone to describe inherent attribute information of faces contained in the image, such as the distance between pupils of two eyes, the size of eyes, the shape of five sense organs including mouth, lip thickness, nose size, eyebrow thickness and the like, and the shape of facial outline and the like, and have certain robustness to age change, namely the facial information represented by the first facial feature and the second facial feature cannot be changed greatly with age; the facial recognition feature is obtained based on an encoder in the generated model, and the facial recognition feature is more prone to describe facial information with discrimination in the image, such as facial global information of facial skin texture, facial contour and the like in the image, and local information of eyes, nose, mouth, eyebrows, chin and the like.

The following describes the generation process of the predicted image in detail: the terminal device obtains a generating model, wherein the generating model can comprise an encoder and a decoder, the source image is input into the encoder, and the input source image can be downsampled based on the encoder to obtain the facial discrimination characteristics corresponding to the source image. Of course, the source image may be preprocessed, that is, the preprocessed image subjected to face detection and feature point positioning in step S101 described above may be input to the encoder, before the source image is input to the encoder. The terminal device may further obtain an initial vector and a sign state condition corresponding to the sign state, determine a target sign dimension corresponding to the sign state condition, update a value corresponding to the target sign dimension in the initial vector, and determine the initial vector after the value update as a state feature corresponding to the sign state condition, where the initial vector includes values corresponding to a plurality of sign dimensions respectively, and of course, the plurality of sign dimensions includes the target sign dimension. The sign state may include an age state, a stature state, a health state, and the like, and the initial vector corresponding to the sign state refers to a vector with all elements being zero, and the dimension of the initial vector is related to the type of the sign state.

When the sign state is an age state, the initial vector may be a zero vector of 100 dimensions, the initial vector may be used to represent an age in a range of 1-100, the sign state condition may be any age in a range of 1-100, for example, when the sign state condition is 12, the target sign dimension corresponding to the sign state condition is the 12 th position in the initial vector, that is, the value 0 at the 12 th position of the initial vector is updated to be the value 1, and the updated initial vector is determined to be a state feature, in other words, the state feature is a vector with the 12 th position being 1 and the rest positions being 0, and it should be understood that the state feature refers to a vector with only the positions corresponding to the sign state condition being 1 and the rest positions being 0. Optionally, the sign state may be a stature state, and the initial vector at this time may be a 5-dimensional zero vector, where the initial vector may be used to represent five conditions of fatness, normal, thin and thin to different degrees, and the sign state condition may be any one of fatness, normal, thin and thin. Optionally, the sign state may be a health state, and the initial vector at this time may be a 5-dimensional zero vector, where the initial vector may be used to represent different health states such as blushing, fever, normal, purple lips, yellow facial, and the condition of the sign state may be any one of blushing, fever, normal, purple lips, yellow facial.

After determining the state features corresponding to the physical sign state conditions, the terminal equipment can splice the state features and the facial discrimination features to obtain joint features, input the joint features into a decoder, and up-sample the input joint features based on a plurality of deconvolution layers in the decoder to obtain a predicted image of the source image under the physical sign state conditions. For example, when the source image is a face image of the user at age 2 and the sign condition is age 12, the predicted image is a predicted image of the user at age 12, that is, the predicted image and the source image represent the same person.

The terminal device inputs the predicted image obtained by generating the model into the recognition model, and based on the recognition model, the second facial feature can be obtained from the predicted image, and the second facial feature extraction process can be described in S101 above with reference to the first facial feature extraction process, which is not described in detail herein.

Step S103, carrying out feature fusion on the first facial feature and the second facial feature in the same dimension to obtain retrieval features;

specifically, the terminal device may perform feature fusion on the first facial feature extracted from the source image and the second facial feature extracted from the predicted image, to obtain a retrieval feature for image retrieval. Feature fusion refers to fusion on the same dimension of features, and a first weight parameter corresponding to a first facial feature and a second weight parameter corresponding to a second facial feature can be determined by acquiring a feature fusion function; and fusing the first facial feature and the second facial feature according to the first weight parameter and the second weight parameter to obtain the retrieval feature. Namely, the method can be realized by the following formula: fea3=g (k×fea1, (1-k) ×fea2), where g () may be expressed as a feature fusion function designed manually, fea1 may be expressed as a first facial feature, fea2 may be expressed as a second facial feature, fea3 may be expressed as a search feature obtained after feature fusion, k may be expressed as a first weight parameter corresponding to the first facial feature, and (1-k) may be expressed as a second weight parameter corresponding to the second facial feature. If g () is a normalization function for limiting the modulus length of the search feature fea3 to 1 and k is a weight parameter within the range of 0, 1.

In this embodiment, the feature stitching mentioned in step S102 may be implemented by concat (a feature merging manner), where feature stitching may be understood as an increase in the number of feature channels, that is, the number of features used to describe the image itself is increased, and the information amount under each feature is unchanged. The feature fusion can be realized by add (a feature merging mode), and the feature fusion can be understood as that the information quantity under the feature of the descriptive image is increased, but the dimension of the descriptive image is not increased, so that the feature fusion on the same dimension can be called. For example, if the first facial feature and the second facial feature each include feature information describing a part of eyes, nose, mouth, eyebrows, chin, etc., feature fusion of the first facial feature and the second facial feature in the same dimension may be understood as: performing feature fusion on local features used for describing eyes in the first facial features and local features used for describing eyes in the second facial features to obtain more accurate local features of the eyes; carrying out feature fusion on local features used for describing the nose in the first facial features and local features used for describing the nose in the second facial features to obtain more accurate local features of the nose; etc.

And step S104, carrying out feature matching on the same dimension on the retrieval feature and a third facial feature corresponding to the facial image contained in the facial image library to obtain a matching result, and determining a target image corresponding to the source image based on the matching result.

Specifically, the terminal device may acquire a face image library, acquire third face features corresponding to each face image in the face image library based on the recognition model, match the search feature with each third face feature in the same dimension, respectively calculate matching values of the search feature and each third face feature in each dimension, further determine matching results corresponding to the search feature and each third face feature according to the matching values corresponding to each dimension, rank all face images in the face image library from high to low according to the matching results, and acquire a target image from the face images after ranking, where the target image may refer to one face image or may refer to multiple face images. When n face images exist in the face image library, third face features corresponding to the n face images can be extracted based on the recognition model respectively and respectively expressed as f ₁ ，f ₂ ，…，f _n Each third facial feature f _i Each may include local features in multiple dimensions, and further calculate the search feature fea3 and each third facial feature (f ₁ ，f ₂ ，…，f _n ) And respectively matching values in each dimension, determining matching results of the retrieval features and the third facial features according to the matching values corresponding to each dimension, sorting all matching results from high to low after calculation of all matching results is completed, and determining face images corresponding to the previous M (M is a positive integer and can be preset) similarity as target images. The similarity calculation method may include Euclidean distance (Eucledian Distance), manhattan distance (Manhattan Distance), minkowski distance (Minkowski distance), cosine similarity (Cosine Similarity),Pearson correlation coefficient (Pearson Correlation Coefficient), etc., and is not limited herein.

In determining the matching result based on the matching value corresponding to each dimension, each dimension may include a key dimension and a regular dimension, where the key dimension refers to local information insensitive to time variation in the face image, that is, local information corresponding to the key dimension may not change with increasing age, and the regular dimension refers to local information except the key dimension in the face image. In other words, the change in the key dimension has a greater influence on the recognition result of face recognition than the conventional dimension. The weights corresponding to the key dimension and the conventional dimension may be set in advance, and for convenience of description, the weight corresponding to the key dimension may be referred to as a first matching weight, and the weight corresponding to the conventional dimension may be referred to as a second matching weight, where the first matching weight is greater than the second matching weight. For example, taking a single third facial feature as an example, a specific description will be given of a determination process of a matching result: the eyes, mouth, chin in a face are assumed to be determined as critical dimensions, and the eyebrows, skin are determined as conventional dimensions. Calculating a matching value 1 between the eye local feature in the search feature and the eye local feature in the third facial feature, a matching value 2 between the mouth local feature in the search feature and the mouth local feature in the third facial feature, and a matching value 3 between the chin local feature in the search feature and the chin local feature in the third facial feature, wherein the matching values 1, 2 and 3 can be called key matching values; and respectively calculating a matching value 4 between the eyebrow local feature in the search feature and the eyebrow local feature in the third facial feature, and a matching value 5 between the skin local feature in the search feature and the skin local feature in the third facial feature, wherein the matching value 4 and the matching value 5 can be called conventional matching values. And then the product of the first matching weight and the key matching value and the product of the second matching weight and the conventional matching value can be added to obtain a matching result between the retrieval feature and the third facial feature.

Optionally, the terminal device may determine, by the relevant department, related subject information of the subject included in the source image, and obtain, based on the relevant department, related biological features corresponding to the related subject, and may further obtain, based on the relevant department, target biological features corresponding to the subject included in the target image, where when the related biological features have an association relationship with the target biological features, it may be determined that the source image and the target image have the same identity information. For example, the source image is the face image of the lost person small A at age 2, the relatives of the small A (such as father, mother and sibling of the small A, etc.) can be determined by the related departments, and the biological characteristics (such as biological genes) of the relatives of the small A can be obtained, meanwhile, the biological characteristics of the person to whom the target image belongs (namely, the person corresponding to the target image, such as the small B) can be acquired through the related departments, and when the biological characteristics of the small B and the biological characteristics of the small A relative are detected to have a relative relationship, the small A and the small B can be determined to be the same person.

Please refer to fig. 4, which is a schematic diagram of an image processing flow according to an embodiment of the present application. Taking the example that the source image is a lost/cranked photo of the face, the overall flow of the image processing will be specifically described. As shown in fig. 4, the image 30a (i.e. the source image) is a photograph taken by a lost/cranked child before the lost or cranked child, after the terminal device obtains the image 30a, the terminal device may perform preprocessing such as face detection, face feature point positioning, etc. on the image 30a to obtain a standardized child image (i.e. a preprocessed image), input the standardized child image into a trained cross-age recognition model 30b (i.e. the recognition model 20b in the embodiment corresponding to fig. 2 above), and extract a first facial feature with a certain robustness to age change from the standardized child image based on the cross-age recognition model 30 b; inputting the standardized child image into the trained cross-age generation model 30c (namely the generation model 20d in the embodiment corresponding to the fig. 2), introducing an age code as a condition variable, and generating a standardized lost/grown-up (such as 12 years old) face looks predictive image (namely the predictive image) based on the cross-age generation model 30 c; inputting the face looks predicted image generated by the cross-age generation model 30c into the cross-age recognition model 30b, and extracting a second face image with certain robustness to age change from the face looks predicted image based on the cross-age recognition model 30 b; the first facial image is feature fused with the second facial image to obtain retrieved face features 30d that are ultimately used to characterize the missing/cranked child's face.

The terminal device may acquire a student status photo library 30e provided by the education department, and the student status photo library 30e may include all student status photos similar to the current age of the lost/cranked child. Of course, the relevant department may determine one or more potential that the lost/cranked child is most likely to appear, then take all the student's school photographs potentially similar to the current age of the lost/cranked child as the school photograph library 30e, input each photograph in the school photograph library 30e into the cross-age recognition model 30b, extract the face feature corresponding to each school photograph, e.g. extract the face feature 1 corresponding to the photograph 30i, the face feature 2 corresponding to the photograph 30j, the face feature 3 corresponding to the photograph 30k, the face feature 4 corresponding to the photograph 30m, and so on. And (3) carrying out retrieval comparison on the retrieval face features 30d and the face features corresponding to each student status photo, and outputting a photo corresponding to the front M (e.g. M=5) face features with the highest feature comparison matching result, namely a retrieval result 30f. The student identity corresponding to each photo in the search result 30f is then determined by the relevant department as a suspected target for the lost/cranked child. The relative departments further acquire the other biological characteristics (such as genes) of the suspected targets, and further confirm lost/cranked children from the suspected targets, and finally obtain the current student status photograph of the lost/cranked children as an image 30h.

Fig. 5 is a flowchart of another image data processing method according to an embodiment of the present application. As shown in fig. 5, the image data processing method may include:

step S201, a first sample data set and a second sample data set are obtained;

Before the facial feature extraction is performed on the source image by using the recognition model, the recognition model needs to be trained, and the following steps S201 to S203 specifically describe the training process of the recognition model. Similarly, before the prediction image corresponding to the source image is acquired by using the generated model, the generated model needs to be trained, and the following step S204 to step S206 specifically describe the training process of the generated model. It should be appreciated that the training process for identifying the model and generating the model are independent of each other and may be performed simultaneously.

In particular, the terminal device may acquire a first sample data set and a second sample data set for training the recognition model. The first sample data set may be a universal face recognition training set, and the first sample data set may include a first sample image and a correspondence between the first sample image and first identity tag information, that is, the first sample data set is a sample face image with different identity information, and each sample face image carries a real identity tag (i.e., first identity tag information); the second sample data set may be an acquired cross-age face recognition training set, and the second sample data set may include a second sample image, and a correspondence between the second sample image, second identity tag information and sample sign states, that is, the second sample data set is a sample face image of a different age group, and each sample face image carries a real identity tag (i.e., second identity tag information) and a real age tag (i.e., sample sign state). The second sample data set has a smaller data size than the first sample data set. In the training process of the recognition model, the first sample data set and the second sample data set can be mixed to be regarded as the same data set, and the model training is performed by using a common loss function; alternatively, the first sample data set and the second sample data set may be passed through two branches to train the recognition model to improve the accuracy of cross-age face image recognition. Common loss functions may include, among others, softmax (a loss function), arcface (a loss function), and the like. In the embodiment of the present application, model training is performed by taking the first sample data set and the second sample data set as two branches as an example, and specific description will be made.

Step S202, determining a first similarity curve corresponding to a first sample data set and a second similarity curve corresponding to a second sample data set based on an initial recognition model;

specifically, the terminal device may perform pairwise combination on the first sample images included in the first sample data set to obtain a first sample pair set, and perform pairwise combination on the second sample images included in the second sample data set to obtain a second sample pair set.

The first sample pair set may include a first positive sample pair and a first negative sample pair, where the first positive sample pair may be formed by sample face images having the same identity tag information in the first sample data set, that is, two-by-two combinations of sample face images belonging to the same person in the first sample data set are performed to construct a first positive sample pair; the first negative sample pair may be formed of sample face images having different identity tag information in the first sample data set, i.e., sample face images not belonging to the same person in the first sample data set are combined two by two to construct the first negative sample pair.

The second sample pair set can comprise a second positive sample pair and a second negative sample pair, wherein the second positive sample pair can be formed by different sample face images which have the same identity label information and have larger age spans in the second sample pair set, namely, sample face images belonging to the same person in the second sample data set are combined in pairs to form the second positive sample pair; the second negative sample pair may be formed by sample face images having different identity tag information in the second sample pair set, i.e. sample face images of different persons in the second sample data set are combined in pairs to construct the second negative sample pair.

The terminal device acquires an initial recognition model, and acquires a first similarity between sample images contained in each first sample pair (including the first positive sample pair and the first negative sample pair) in the first sample pair set and a second similarity between sample images contained in each second sample pair (including the second positive sample pair and the second negative sample pair) in the second sample pair set based on the initial recognition model.

The terminal equipment inputs the sample face images contained in the first sample pair set into an initial recognition model, and can acquire feature vectors corresponding to each sample face image based on the initial recognition model, so that a first vector group corresponding to each sample pair in the first sample pair set can be acquired, and first similarity between two sample face images contained in each sample pair in the first sample pair is calculated according to the first vector group. For example, when there is one sample pair (image 1, image 2) in the first sample pair set, the initial recognition model may be used to obtain the feature vector 1 corresponding to the image 1 and the feature vector 2 corresponding to the image 2, where the feature vector 1 and the feature vector 2 are the first vector group corresponding to the sample pair (image 1, image 2), and the similarity between the feature vector 1 and the feature vector 2 is the first similarity corresponding to the sample pair (image 1, image 2).

Similarly, the terminal device may input the sample face images included in the second sample pair set into the initial recognition model, and may obtain, based on the recognition model, a second vector group corresponding to each sample pair in the second sample pair set, and calculate, according to the second vector group, a second similarity between two sample face images included in each sample pair in the second sample pair.

The terminal device may determine a first similarity curve corresponding to the first sample data set based on the first similarity, and determine a second similarity curve corresponding to the second sample data set based on the second similarity. When the first sample pair set includes a first positive sample pair and a first negative sample pair, the second sample pair set includes a second positive sample pair and a second negative sample pair, the first similarity curve corresponding to the first sample data set may include a first positive sample similarity curve (i.e., a similarity curve corresponding to a sample face image having the same identity tag information in the first sample data set) and a first negative sample similarity curve (i.e., a similarity curve corresponding to a sample face image having different identity tag information in the first sample data set), and the second similarity curve corresponding to the second sample data set may also include a second positive sample similarity curve (i.e., a similarity curve corresponding to a sample face image having the same identity tag information in the second sample data set) and a second negative sample similarity curve (i.e., a similarity curve corresponding to a sample face image having different identity tag information in the second sample data set).

Step S203, according to the first similarity curve and the second similarity curve, correcting the network parameters of the initial recognition model, and determining the initial recognition model containing the corrected network parameters as a recognition model;

specifically, the terminal device may determine a loss function corresponding to the initial recognition model according to the first similarity curve and the second similarity curve, calculate a distance between the first similarity curve and the second similarity curve according to the loss function, and correct the network parameter of the initial recognition model by using the distance as the adjustment information.

It should be appreciated that when the first similarity curve includes a first positive sample similarity curve and a second positive sample similarity curve, and the second similarity curve includes a second positive sample similarity curve and a second negative sample similarity curve, a first distance between the first positive sample similarity curve and the second positive sample similarity curve, a second distance between the first negative sample similarity curve and the second negative sample similarity curve, and a degree of overlap between the second positive sample similarity curve and the second negative sample similarity curve may be calculated, respectively, and then network parameters of the initial recognition model may be corrected based on the first distance, the second distance, and the degree of overlap, and an initial recognition model (i.e., a trained initial recognition model) including the corrected network parameters may be determined as the recognition model.

Please refer to fig. 6, which is a schematic diagram of training an initial recognition model according to an embodiment of the present application. As shown in fig. 6, the terminal device obtains a training set 40a for training the initial recognition model, the training set 40a may include a first sample data set P _ε And a second sample data set P _H First sample data set P _ε Is a general face image set, such as a CelebA data set (a published large face recognition data set containing 200k face images), a Colorferet database (a general face library published by the national defense of the United states, containing more than 10000 photos of more than 1000 people), and the like; second sample data set P _H The second sample data set P is a cross-age face image set _H Face images of different people at different ages may be included. Second sample data set P _H The amount of image data in (a) is small, typically several hundred or thousands of images.

For the first sample data set P _ε The terminal equipment can construct negative sample pairs x-P composed of face images with different identities _ε And a positive sample pair (x _i ,x _j )～P _ε Wherein x-P _ε Representing a first sample dataset P _ε Face images of different identities contained in (x) _i ,x _j )～P _ε Representing a first sample dataset P _ε Image pair belonging to the same person, x _i 、x _j Respectively represent the first sample data set P _ε Different face images corresponding to the same person.

For the second sample data set P _H The terminal equipment can construct negative sample pairs x-P composed of face images with different identities _H And a positive sample pair (x _m ,x _n )～P _H Wherein x-P _H Representing a second sample data set P _H Face images of different identities contained in (x) _m ,x _n )～P _H Representing a second sample data set P _H Image pair belonging to the same person, x _m 、x _n Respectively represent the second sample data set P _H Different face images, e.g. x, corresponding to the same person _m Can be a face image of a person at 3 years old, x _n May be an image of the face of the person at the age of 20 years.

Alternatively, the terminal device may send a first sample data set P _ε And a second sample data set P _H Respectively carrying out batch processing, namely, according to actual needs, a first sample data set P _ε And a second sample data set P _H Divided into a plurality of data batches (minimatch), each data batch may comprise a first sample data set P _ε Face image and second sample data set P _H Then for each data batch belonging to the first sample data set P _ε And belonging to a second sample dataset P _H Respectively constructing a positive sample pair and a negative sample pair.

The terminal device may continuously input the face image in each data batch into an initial recognition model, where the initial recognition model may be a convolutional neural network. The initial recognition model can output corresponding feature vectors for each input face image, and then a first sample data set P can be obtained _ε A corresponding set of feature vectors 40b, the set of feature vectors 40b may include positive sample pairs (x _i ,x _j )～P _ε Corresponding pairs of eigenvectors and negative samples x-P _ε The corresponding feature vector. A second sample data set P can also be obtained _H A corresponding set of feature vectors 40c, the set of feature vectors 40c may include positive sample pairs (x _m ,x _n )～P _H Corresponding pairs of eigenvectors and negative samples x-P _H The corresponding feature vector.

Based on the feature vector set 40b, a first sample data set P may be calculated _ε Similarity between pairs of images (which may be calculated using cosine distance or euclidean distance, etc.), to determine the first sample dataset P _ε A corresponding first similarity curve 40d, wherein the first similarity curve 40d may include a first patternThe data set P _ε The similarity curves 40f corresponding to face images having different identities and the similarity curves 40g corresponding to face images having the same identity. Based on the feature vector set 40c, a second sample data set P may be calculated _H Similarity between pairs of images in order to determine a second sample dataset P _H A corresponding second similarity curve 40e, wherein the second similarity curve 40e may comprise a second sample data set P _H The similarity curves 40h corresponding to face images having different identities and the similarity curves 40i corresponding to face images having the same identity.

And determining a loss function corresponding to the initial recognition model according to the similarity curve 40f, the similarity curve 40g, the similarity curve 40h and the similarity curve 40i, optimizing the loss function by using an optimization algorithm (such as a random gradient descent algorithm) to correct the network parameters of the initial recognition model, and determining the initial recognition model with the optimal network parameters as the initial recognition model when the loss function is converged.

The process of modifying the initial recognition model network parameters may include: calculating the distance between the similarity curve 40f and the similarity curve 40h may be referred to as distance 1; calculating the distance between the similarity curve 40g and the similarity curve 40i may be referred to as distance 2; and calculating the overlapping degree between the similarity curve 40h and the similarity curve 40i, and correcting the network parameters of the initial recognition model by using the distance 1, the distance 2 and the overlapping degree as correction information.

Step S204, obtaining a second sample image contained in a second sample data set, and obtaining sample identification features corresponding to the second sample image based on an initial encoder in an initial generation model;

specifically, the terminal device may train the initial encoder in the initial decoder separately from the initial decoder, and before the initial decoder starts to train, the initial encoder needs to be trained in advance, and the initial encoder train is completed, that is, after the encoder is acquired, the terminal device may start to train the initial decoder.

The terminal device may train the initial decoder using the second sample data set, input the second sample image in the second sample data set into the trained encoder, and obtain the sample authentication feature corresponding to the second sample image based on the trained encoder. In other words, after the second sample image is input into the encoder, the sample image may be downsampled in the encoder, and the sample authentication feature in the second sample image may be extracted.

Before the second sample image is input to the encoder, the second sample image may be further preprocessed, where the preprocessing may be referred to the description of step S101 in the embodiment corresponding to fig. 3, and no further description is given here.

Step S205, obtaining sample sign state conditions, inputting the sample sign state conditions and sample identification features into an initial decoder of an initial generation model, and generating a sample prediction image of a second sample image under the sample sign state conditions based on the initial decoder;

specifically, the terminal device may obtain a sample sign state condition, determine a sample state feature corresponding to the sample sign state condition based on an initial vector corresponding to the sign state, and splice the sample identification feature and the sample state feature to obtain a sample joint feature. The initial vector corresponding to the sign state is preset, the initial vector is a zero vector, namely a vector which is all zero, and then the dimension of the initial vector is determined according to the sign type of the sign state, if the sign state is in an age state, the age range is 1-100, and the dimension of the initial vector can be determined to be 100 dimensions; the age range is 10-90 and the dimension of the initial vector can be determined to be 80 dimensions. In the training process, the real age of the second sample image can be used as the sample body state condition corresponding to the second sample image, and the subsequent process of generating the sample prediction image can be understood as the image reconstruction process.

Optionally, face images in the second sample dataset with the same identity tag information and different age information may be formed into a set of training sample sets, for example, face images corresponding to the user 1 in the training sample set 1, face images corresponding to the user 2 in the training sample set 2, and so on, where each face image carries real age information. When an image is selected from the training sample set 1 and input into the initial generation model, the sample sign state condition can be age information corresponding to any face image in the training sample set 1.

The terminal equipment acquires an initial decoder, namely, the terminal equipment initializes the decoder needing training, and the decoder which completes initialization is called an initial decoder. The terminal device inputs the sample joint feature into an initial decoder, and up-samples the sample joint feature based on the initial decoder to generate a sample prediction image corresponding to the second sample image. In other words, a sample predicted image is an image obtained by an already trained encoder and an initial decoder.

Step S206, training an initial generation model based on the second sample image and the sample prediction image, and determining the initial generation model after training to be a generation model;

Specifically, the terminal device acquires a discrimination model corresponding to the initial decoder, and the discrimination model can be used for identifying the probability that the image belongs to the real image type, namely, distinguishing the real image from the false image generated by the initial decoder; the judging model can also be used for estimating physical sign state errors of the image, namely, the error between estimated physical sign state information corresponding to the false image generated by the initial decoder and sample physical sign state conditions. For example, when the sample physical sign state condition is 8 years old, the sample prediction image generated by the initial decoder is input into the discrimination model, so that a sample prediction image corresponding to the sample image can be generated, the sample prediction image is input into the discrimination model, the discrimination model can estimate age information corresponding to the sample prediction image, and then an error between the estimated age information and the sample physical sign state condition (namely, the 8 years old) can be determined.

The following describes how to obtain the discrimination model: since the discrimination model is mainly used for distinguishing the real image from the false image (i.e. the sample prediction image) generated by the initial decoder, that is, the discrimination model can be used for classifying the problems, the terminal equipment needs to initialize the classifying model, and the classifying model after the initialization is used as the discrimination model. The data for training the discrimination model may include the acquired image obtained by the terminal device from the second sample data set and the sample prediction image generated by the initial decoder, and the final purpose of training the discrimination model is that the discrimination model may determine the acquired image as a true image class and the sample prediction image generated by the initial decoder as a false image type. The discriminant model may also be used to estimate an error between the estimated physical sign state information and the sample physical sign state condition corresponding to the sample predictive image, and the training of the discriminant model may further include minimizing the error between the estimated physical sign state information and the sample physical sign state condition of the sample predictive image.

In the training stage of the discrimination model, since the training data of the discrimination model is related to the acquired image and the sample prediction image generated by the initial decoder, a discrimination error corresponding to the discrimination model, which may also be referred to as a generated contrast loss function, may be expressed as formula (1):

L _dis (D)＝min _G max _D E _y [log(D(x))]+E _x,a [1-log(D(G(F(x),a)))] (1)

wherein L is _dis The method comprises the steps of representing a discrimination error corresponding to a discrimination model, wherein G represents an initial modeling, D represents the discrimination model, F represents an encoder, x represents a real image collected in a sample training data set, a represents a sample state feature corresponding to a sample state condition, F (x) represents a sample discrimination feature extracted by the encoder, G (F (x), a) represents a simulation image (also called a false image, such as a sample prediction image) under the sample state condition generated by an initial decoder, and therefore D (G (F (x), a)) represents the probability that the sample prediction image corresponding to the sample image x generated by the initial decoder belongs to the real image type, and D (x) represents the probability that the sample image x belongs to the real image type. Therefore, the discrimination error is based on the matching degree of the real images (i.e. the probability of the sample image being judged as the real image type and the probability of the sample predicted image being judged as the real image type) of the sample image and the sample predicted image corresponding respectively Rate) is determined.

The discriminant model may also be used to calculate an estimation error of the sample physical sign state condition, taking the sample physical sign state condition as an example, the discriminant model may calculate an age estimation error (i.e., physical sign state error) that may be expressed as formula (2):

/>

wherein II ₁ Represented as a 1-norm,the estimated age of the sample predicted image can be expressed, q can be expressed as the true age, q can be understood as the sample sign state condition, and when the sample sign state condition is the true age corresponding to the sample image, q is the age information corresponding to the sample image.

Based on the above formula (1) and formula (2), a total error corresponding to the discrimination model (which may also be referred to as a second total error, distinguished from a total error corresponding to the initial decoder) may be determined, and may be expressed as formula (3):

L _D ＝L _age (D)+λ _dis L _dis (D) (3)

wherein lambda is _dis The hyper-parameters, which represent the relative weights used to control the discrimination error with respect to the age estimation error (i.e., the sign status error), may also be referred to as error weight parameters. Equation (3) can be understood as the sum of the age estimation error and the discriminant error in the discriminant model. By minimizing equation (3), i.e., minimizing the second total error of the discriminant model, the network parameters of the discriminant model may be modified to obtain optimal network parameters of the discriminant model.

The training process of the initial decoder and the discriminant model can be regarded as a game process, that is, the purpose of the discriminant model is opposite to that of the initial decoder, and the purpose of the discriminant model training is as follows: for any image input into the initial discrimination model, whether the input image is a real image or a simulation image can be accurately distinguished. When the probability value output by the judging model D is larger than 0.5, judging the input image as a real image; when the probability value of the output of the discrimination model D is smaller than 0.5, the input image is discriminated as a simulated image (may be referred to as a dummy image), in other words, for the discrimination model, the larger the value of log (D (x) is, the better (maximum value is 1) in the formula (2), the smaller the value of log (D (G (F (x), a))) is, the better (minimum value is 0) in the formula (2).

The complete loss function (which may also be referred to as a first total error) of the initial decoder is described below.

Taking the sample sign state condition as the real age information of the sample image as an example, after the sample prediction image is input into the discrimination model in the initial decoder training stage, the countermeasure error corresponding to the initial decoder can be calculated and expressed as formula (4):

L _adv (F,D)＝E _x,a [-log(D(G(F(x),a)))] (4)

the above equation (4) can be understood as a variation of the above equation (1), and the discriminant model can be regarded as fixed during the training of the initial decoder, so E in the above equation (1) _y [log(D(x))]I.e., a constant, where the initial generation of the network aims at maximizing log (D (G (F (x), a))), i.e., minimizing L _adv (F,D)。

When the sample sign state condition is the true age corresponding to the sample object, the process of generating the sample prediction image by the initial decoder is equivalent to the reconstruction process of the sample image, in order to improve the capability of the initial decoder to reconstruct the image, the pixel error between the sample image and the sample prediction image may be calculated, and the pixel error may be expressed as formula (5):

L _pixel (F,G)＝‖x-G(F(x),a)‖ ₁ (5)

wherein L is _pixel (F, G) denotes pixel errors, which may also be referred to herein as reconstruction errors, the network parameters of the initial decoder are corrected by minimizing the error between the sample image x and the reconstructed image.

Optionally, if the sample sign state condition is the true age corresponding to another image in the training sample set where the initial generated model input image is located, if the input sample image x is a face image of a person at age 2, the sample sign state condition is age 8, and the training sample set has a face image y of the person at age 8, the above formula (5) may be rewritten as follows: l (L) _pixel (F,G)＝‖y-G(F(x),a)‖ ₁ Wherein y can be expressed as a real face image corresponding to the sample sign state condition a in the sample training dataset.

In order to ensure that the sample predicted image generated by the initial decoder has the discrimination characteristics of the original input image (i.e., the second sample image), an encoder (which is the same model as the encoder for acquiring the sample discrimination characteristics) is also introduced after the sample predicted image is generated, and by inputting the sample predicted image generated by the initial decoder into the encoder, it is identified whether the sample predicted image and the original input image are of the same class. The terminal device may determine an error between the sample prediction image and the original input image as a classification error, i.e., determine a classification error corresponding to the sample prediction image by calculating identity similarity between the sample prediction image and the second sample image, which may be expressed as formula (6):

L _cls ＝-v ^T logF(G(F(x),a)) (6)

Wherein L is _cls And (5) representing the classification error of the sample prediction image, and v representing the real classification label characteristics corresponding to the sample prediction image. It should be noted that, the real classification label feature herein refers to the face class corresponding to the sample image xFace images of the same person at different ages all have the same classification tag characteristics. The learning of the discriminating characteristic in the sample prediction image may be guided by the classification constraint of the encoder.

As can be seen from the above, the total error of the initial decoder (which may also be referred to as the first total error) can be expressed as formula (7):

L _gen ＝λ _pixel L _pixel +λ _cls L _cls +λ _adv L _adv (7)

wherein lambda is _pixel 、λ _cls Lambda of _adv Are all weight parameters lambda _pixel The weight parameter corresponding to the pixel error can also be called as a first weight parameter; lambda (lambda) _cls The weight parameter corresponding to the classification error can also be called a second weight parameter; lambda (lambda) _adv The weight parameter corresponding to the countermeasure error may be also referred to as a third weight parameter. The equation (7) can be understood as the sum of errors after the pixel error, the classification error and the antagonism error are multiplied by the corresponding weight parameters, and the network parameters of the initial decoder are corrected by minimizing the equation (7), that is, minimizing the first total error of the initial decoder, and iterating until the first total error value corresponding to the equation (7) is smaller than the target threshold, or the change rate of the first total error value is smaller than the change rate threshold, or the number of iterations reaches the target number, where the obtained network parameters of the initial decoder are optimal parameters.

And (3) correcting the network parameters of the initial decoder and the network parameters of the judging model based on the formula (3) and the formula (7), and determining the encoder and the initial decoder with the optimal network parameters as a generation model after the optimal network parameters are achieved, wherein the generation model can be used for generating a cross-age face image, namely the generation model is used for generating a predicted image corresponding to a source image, and the predicted image and the source image have the same identity information.

It can be understood that, for the same person, as long as a face image of a certain age of the person exists, the face image of the certain age is taken as an original input image, the face image corresponding to any age of the person except the age can be generated by using the trained generation model, and the change trend of the face of the person along with the age can be explicitly shown by the generated face image under the condition of each age. It should be appreciated that the discriminant model is not used during the training of the generative model, i.e., during the generation of the predictive image.

Please refer to fig. 7, which is a schematic diagram of a training initial decoder according to an embodiment of the present application. As shown in fig. 7, the training process of the initial decoder specifically includes: inputting a sample image (namely a second sample image in a second sample image data set) into a trained encoder, acquiring a sample discrimination feature corresponding to the sample image in the encoder, acquiring a sample sign state condition corresponding to the sample image, inputting the sample discrimination feature and the sample state feature corresponding to the sample sign state condition into an initial decoder, generating a sample prediction image associated with the sample state feature based on an up-sampling process in the initial decoder, inputting the sample prediction image into a discrimination model, judging whether the sample prediction image is a real image or a false image by the discrimination model, estimating the age of the sample prediction image, and judging the error between the estimated age and the real age. Meanwhile, the terminal equipment can also input the sample predicted image into an encoder, extract the characteristics of the sample predicted image according to the encoder, classify the sample predicted image according to the extracted characteristics, identify the matching degree of the sample predicted image and various face categories, and determine the classification error corresponding to the discrimination image.

Based on all second sample images contained in the second sample data set, the initial generation network is trained by adopting the training process, so that an initial decoder with stronger image imitation capability can be obtained, and the initial decoder can learn the mapping relation between face images of the same person at different ages. In other words, for the initial generation network after training, as long as the original face image and age information to be converted are input, a face image at a specific age can be generated based on the encoder and the trained initial decoder, and the newly generated predicted face image and the original face image represent the same person.

Step S207, acquiring a source image, and acquiring a first facial feature corresponding to the source image based on an encoder;

step S208, acquiring physical sign state conditions, acquiring a predicted image of the source image under the physical sign state conditions based on the generation model, and acquiring a second facial feature corresponding to the predicted image based on the encoder;

step S209, carrying out feature fusion on the first facial feature and the second facial feature in the same dimension to obtain retrieval features;

and step S210, carrying out feature matching on the same dimension on the retrieval feature and a third facial feature corresponding to the facial image contained in the facial image library to obtain a matching result, and determining a target image corresponding to the source image based on the matching result.

The specific implementation manner of step S207 to step S210 may refer to step S101 to step S104 in the embodiment corresponding to fig. 3, and will not be described herein.

It will be appreciated that in particular embodiments of the present application, related data such as images or videos of users (e.g., face images or videos may be specifically) may be acquired, and when the embodiments of the present application are applied to particular products or technologies, permission or consent of the related users needs to be obtained, and collection, use and processing of the related data needs to comply with related laws and regulations and standards of related countries and regions.

Fig. 8 is a schematic structural diagram of an image data processing apparatus according to an embodiment of the present application. As shown in fig. 8, the image data processing apparatus 1 may include: the device comprises a first acquisition module 11, a second acquisition module 12, a feature fusion module 13 and a retrieval module 14;

a first obtaining module 11, configured to obtain a source image, and obtain a first facial feature corresponding to the source image based on an identification model;

a second obtaining module 12, configured to obtain a sign state condition, obtain a predicted image of the source image under the sign state condition based on a generating model, and obtain a second facial feature corresponding to the predicted image based on the identifying model;

the feature fusion module 13 is configured to fuse the features of the first facial feature and the second facial feature in the same dimension to obtain a search feature;

and the retrieval module 14 is configured to perform feature matching on the same dimension on the retrieval feature and a third facial feature corresponding to a facial image included in the facial image library, obtain a matching result, and determine a target image corresponding to the source image based on the matching result.

The specific functional implementation manners of the first obtaining module 11, the second obtaining module 12, the feature fusion module 13, and the retrieving module 14 may refer to step S101 to step S104 in the embodiment corresponding to fig. 3, which are not described herein.

Referring to fig. 8, the image data processing apparatus 1 may further include: the system comprises a sample data acquisition module 15 similarity curve determination module 16, a network parameter correction module 17, a sample feature acquisition module 18, a sample predicted image generation module 19, a training module 20, a biological feature acquisition module 21 and a biological feature comparison module 22;

a sample data acquisition module 15 for acquiring a first sample data set and a second sample data set; the first sample data set comprises a corresponding relation between a first sample image and first identity tag information, and the second sample data set comprises a corresponding relation between a second sample image, second identity tag information and sample sign states;

a similarity curve determining module 16, configured to determine a first similarity curve corresponding to the first sample data set and a second similarity curve corresponding to the second sample data set based on an initial recognition model;

a network parameter correction module 17, configured to correct a network parameter of the initial recognition model according to the first similarity curve and the second similarity curve, and determine the initial recognition model including the corrected network parameter as the recognition model;

A sample feature obtaining module 18, configured to obtain a second sample image included in the second sample data set, and obtain a sample authentication feature corresponding to the second sample image based on an initial encoder in an initial generation model;

a sample prediction image generating module 19, configured to obtain a sample sign state condition, input the sample sign state condition and the sample identification feature into an initial decoder of the initial generation model, and generate a sample prediction image of the second sample image under the sample sign state condition based on the initial decoder;

a training module 20, configured to train the initial generation model based on the second sample image and the sample prediction image, and determine the initial generation model after training as the generation model;

a biological feature acquisition module 21, configured to determine related subject information associated with the source image, and acquire related biological features corresponding to the related subject information;

the biometric comparison module 22 is configured to acquire a target biometric associated with the target image, and determine that the source image and the target image have the same identity information when the target biometric has an association relationship with a related biometric.

The specific functional implementation of the sample data obtaining module 15, the similarity curve determining module 16, the network parameter correcting module 17, the sample feature obtaining module 18, the sample predicted image generating module 19, and the training module 20 may refer to step S201-step S206 in the embodiment corresponding to fig. 5, and the specific functional implementation of the biological feature obtaining module 21 and the biological feature comparing module 22 may refer to step S104 in the embodiment corresponding to fig. 3, which are not described herein.

Referring to fig. 8, the first obtaining module 11 may include: a preprocessing unit 111, a first feature extraction unit 112;

a preprocessing unit 111, configured to acquire the source image, perform face detection on the source image, and determine, when detecting that a target face area exists in the source image, the target face area as a preprocessed image corresponding to the source image;

a first feature extraction unit 112, configured to input the preprocessed image into a recognition model, and extract the first facial feature from the preprocessed image based on the recognition model.

The specific functional implementation manner of the preprocessing unit 111 and the first feature extraction unit 112 may refer to step S101 in the embodiment corresponding to fig. 3, which is not described herein.

Referring also to fig. 8, the second acquisition module 12 may include: an encoding unit 121, a splicing unit 122, a decoding unit 123, and a second feature extraction unit 124;

an encoding unit 121, configured to input the source image into an encoder that generates a model, and acquire a facial discrimination feature corresponding to the source image based on the encoder;

a stitching unit 122, configured to obtain the sign status condition, stitch a status feature corresponding to the sign status condition with the face identification feature, and obtain a joint feature;

a decoding unit 123 for inputting the joint feature to a decoder of the generation model, generating a predicted image of the source image under the sign state condition based on the decoder;

a second feature extraction unit 124 for inputting the predicted image into the recognition model, and extracting the second facial feature from the predicted image based on the recognition model.

The specific functional implementation manners of the encoding unit 121, the splicing unit 122, the decoding unit 123, and the second feature extraction unit 124 may refer to step S102 in the embodiment corresponding to fig. 3, which is not described herein.

Referring to fig. 8, the feature fusion module 13 may include: a parameter acquisition unit 131, a retrieval feature determination unit 132;

A parameter obtaining unit 131, configured to obtain a first weight parameter corresponding to the first facial feature and a second weight parameter corresponding to the second facial feature;

the search feature determining unit 132 is configured to perform feature fusion on the same dimension on the first facial feature and the second facial feature based on the first weight parameter and the second weight parameter, so as to obtain a search feature.

The specific functional implementation manner of the parameter obtaining unit 131 and the search feature determining unit 132 may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein.

Referring also to fig. 8, the retrieval module 14 may include: a library feature acquisition unit 141, a matching result determination unit 142, a target image selection unit 143;

a library feature acquiring unit 141, configured to acquire third facial features corresponding to each facial image in the facial image library, based on the recognition model;

a matching result determining unit 142, configured to perform feature matching on the same dimension on the search feature and the third facial feature, obtain matching values of the search feature and the third facial feature in each dimension, and determine the matching result based on the matching values corresponding to each dimension;

And a target image selecting unit 143 configured to rank each of the face images based on the matching result, and select the target image from the ranked face images in the ranking order.

The dimensions include critical dimensions and conventional dimensions; the matching values corresponding to the dimensions comprise key matching values and conventional matching values;

the matching result determining unit 142 specifically is configured to:

The specific functional implementation manner of the library feature obtaining unit 141, the matching result determining unit 142, and the target image selecting unit 143 may refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein.

Referring also to fig. 8, the similarity curve determining module 16 may include: a combining unit 161, a similarity determining unit 162, a curve determining unit 163;

A combining unit 161, configured to perform pairwise combination on the first sample images included in the first sample data set to obtain a first sample pair set, and perform pairwise combination on the second sample images included in the second sample data set to obtain a second sample pair set;

a similarity determining unit 162, configured to obtain, based on the initial recognition model, a first similarity between sample images included in each first sample pair in the first sample pair set, and a second similarity between sample images included in each second sample pair in the second sample pair set;

a curve determining unit 163, configured to determine the first similarity curve corresponding to the first sample data set based on the first similarity, and determine the second similarity curve corresponding to the second sample data set based on the second similarity.

The specific functional implementation manner of the combining unit 161, the similarity determining unit 162, and the curve determining unit 163 may refer to step S202 in the embodiment corresponding to fig. 5, which is not described herein.

Referring to fig. 8, the network parameter correction module 17 may include: a loss function determination unit 171, a correction unit 172;

A loss function determining unit 171, configured to determine a loss function corresponding to the initial recognition model based on the first similarity curve and the second similarity curve;

and a correction unit 172, configured to correct the network parameters of the initial recognition model according to the loss function, and determine the initial recognition model including the corrected network parameters as the recognition model.

The specific function implementation manner of the loss function determining unit 171 and the correcting unit 172 may refer to step S203 in the embodiment corresponding to fig. 5, which is not described herein.

Referring to fig. 8, the preprocessing unit 111 may include: a face detection subunit 1111, a candidate region determination subunit 1112, and a target region determination subunit 1113;

a face detection subunit 1111, configured to acquire the source image, generate a candidate region corresponding to the source image, and perform face detection on the candidate region;

a candidate region determination subunit 1112, configured to determine, if a face contour is detected to exist in the candidate region, the candidate region in which the face contour exists as a candidate face region;

the target area determining subunit 1113 is configured to perform regression correction on the candidate face area to obtain the target face area, and determine the target face area as a preprocessed image corresponding to the source image.

The specific functional implementation manner of the face detection subunit 1111, the candidate region determination subunit 1112, and the target region determination subunit 1113 may refer to step S101 in the embodiment corresponding to fig. 3, which is not described herein.

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 9, the computer device 1000 may include: processor 1001, network interface 1004, and memory 1005, and in addition, the above-described computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 9, an operating system, a network communication module, a user interface module, and a device control application may be included in a memory 1005, which is one type of computer-readable storage medium.

In the computer device 1000 shown in FIG. 9, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the image data processing method in any of the embodiments corresponding to fig. 3 and 5, and may also perform the description of the image data processing apparatus 1 in the embodiment corresponding to fig. 8, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

Furthermore, it should be noted here that: the embodiments of the present application further provide a computer readable storage medium, where a computer program executed by the image data processing apparatus 1 mentioned above is stored, and the computer program includes program instructions, when executed by the processor, can perform the description of the image data processing method in any of the embodiments corresponding to fig. 3 and 5, and therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims

1. An image data processing method, comprising:

acquiring physical sign state conditions, and splicing state features corresponding to the physical sign state conditions with the facial identification features to obtain joint features;

inputting the predicted image into the recognition model, and extracting a second facial feature from the predicted image based on the recognition model;

based on the first weight parameter and the second weight parameter, carrying out feature fusion on the first facial feature and the second facial feature in the same dimension to obtain retrieval features;

2. The method of claim 1, wherein the acquiring the source image, acquiring the first facial feature corresponding to the source image based on the recognition model, comprises:

3. The method of claim 2, wherein the acquiring the source image, performing face detection on the source image, and determining the face region in the source image as a pre-processed image when the face region is detected to be present in the source image, comprises:

4. The method according to claim 1, wherein the matching the search feature with a third facial feature corresponding to a facial image included in a facial image library to the feature in the same dimension to obtain a matching result, and determining a target image corresponding to the source image based on the matching result includes:

5. The method of claim 4, wherein the dimensions include a critical dimension and a regular dimension; the matching values corresponding to the dimensions comprise key matching values and conventional matching values;

6. The method as recited in claim 1, further comprising:

7. The method as recited in claim 1, further comprising:

8. The method as recited in claim 7, further comprising:

9. An image data processing apparatus, comprising:

the splicing unit is used for acquiring physical sign state conditions, and splicing the state features corresponding to the physical sign state conditions with the facial identification features to obtain joint features;

a second feature extraction unit configured to input the predicted image into the recognition model, and extract a second facial feature from the predicted image based on the recognition model;

the retrieval feature determining unit is used for carrying out feature fusion on the first facial feature and the second facial feature in the same dimension based on the first weight parameter and the second weight parameter to obtain retrieval features;

10. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 8.

11. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the steps of the method according to any of claims 1 to 8.