CN115862054A - Image data processing method, apparatus, device and medium - Google Patents

Image data processing method, apparatus, device and medium Download PDF

Info

Publication number
CN115862054A
CN115862054A CN202111123361.1A CN202111123361A CN115862054A CN 115862054 A CN115862054 A CN 115862054A CN 202111123361 A CN202111123361 A CN 202111123361A CN 115862054 A CN115862054 A CN 115862054A
Authority
CN
China
Prior art keywords
image
recognition model
sample
image recognition
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111123361.1A
Other languages
Chinese (zh)
Inventor
徐稀侠
高英国
鄢科
黄飞跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111123361.1A priority Critical patent/CN115862054A/en
Publication of CN115862054A publication Critical patent/CN115862054A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

An embodiment of the application provides an image data processing method, an apparatus, a device and a medium, including: outputting a first classification result of the sample image through the initial image recognition model, and generating a first activation mapping map according to the first classification result and the sample convolution characteristics of the sample image; carrying out data transformation on the sample image to obtain a deformed image, outputting a second classification result of the deformed image through the initial image recognition model, and generating a second activation mapping map according to the second classification result and the deformation convolution characteristics of the deformed image; determining a similarity loss result according to the first activation mapping chart and the second activation mapping chart, and determining a classification loss result according to the first classification result, the second classification result and the key point class label of the sample image; and correcting the network parameters of the initial image recognition model based on the similarity loss result and the classification loss result to generate a target image recognition model. By the method and the device, the image processing efficiency can be improved, and the positioning accuracy of the model can be improved.

Description

Image data processing method, apparatus, device and medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, and a medium for processing image data.
Background
The human body posture estimation can detect the positions of all human body joint points and bones in pictures or videos, and has very wide application value in the fields of movie animation, virtual reality, video monitoring, action recognition and the like.
In the current human body posture estimation algorithm, model training can be performed by outputting predicted position information of each human body key point and calculating distance loss according to the predicted position information and real position information. The training process of the model depends on the labeled data, namely the picture or video used for training the model needs to mark the real position information of each human body key point, and the more the predicted position information output by the model is close to the real position information, the better the recognition effect of the model is represented. Therefore, the pictures and videos used in the model training are limited, that is, the real position information of each human body key point needs to be labeled, however, the labeling cost of the pictures or videos is very time and labor consuming, which causes the processing efficiency of the pictures or videos to be too low; in addition, since the human body itself is flexible, there may be a complex gesture in practical applications (the gesture may not appear in the picture or video used for training the model), and the accuracy of the model obtained by training for predicting the key points of the human body is too low.
Disclosure of Invention
The embodiment of the application provides an image data processing method, device, equipment and medium, which can improve the image processing efficiency and improve the positioning accuracy of a model.
An embodiment of the present application provides an image data processing method, including:
obtaining a sample image, outputting a first classification result corresponding to the sample image through an initial image recognition model, and generating a first activation mapping chart according to the first classification result and the sample convolution characteristics of the sample image; the first classification result is determined by target posture characteristics corresponding to the sample object in the sample image, and the first activation mapping graph is used for representing the position information of key points of the sample object in the sample image;
carrying out data transformation on the sample image to obtain a deformed image, outputting a second classification result corresponding to the deformed image through the initial image recognition model, and generating a second activation mapping map according to the second classification result and the deformation convolution characteristics of the deformed image; the second classification result is determined by deformation posture characteristics corresponding to the sample object in the deformation image, and the second activation mapping graph is used for representing the position information of key points of the sample object in the deformation image;
determining a similarity loss result of the initial image recognition model according to the first activation mapping chart and the second activation mapping chart, and determining a classification loss result of the initial image recognition model according to the first classification result, the second classification result and the key point class label carried by the sample image;
based on the similarity loss result and the classification loss result, correcting the network parameters of the initial image recognition model to generate a target image recognition model; the target image recognition model is used for predicting the object component category and the positioning result corresponding to the key points in the source image.
An embodiment of the present application provides an image data processing apparatus, including:
the first generation module is used for acquiring a sample image, outputting a first classification result corresponding to the sample image through an initial image recognition model, and generating a first activation mapping according to the first classification result and the sample convolution characteristics of the sample image; the first classification result is determined by target posture characteristics corresponding to the sample object in the sample image, and the first activation map is used for representing the position information of key points of the sample object in the sample image;
the second generation module is used for carrying out data transformation on the sample image to obtain a deformed image, outputting a second classification result corresponding to the deformed image through the initial image recognition model, and generating a second activation mapping map according to the second classification result and the deformation convolution characteristics of the deformed image; the second classification result is determined by deformation posture characteristics corresponding to the sample object in the deformation image, and the second activation mapping graph is used for representing the position information of key points of the sample object in the deformation image;
the loss result determining module is used for determining a similarity loss result of the initial image recognition model according to the first activation mapping chart and the second activation mapping chart, and determining a classification loss result of the initial image recognition model according to the first classification result, the second classification result and the key point class label carried by the sample image;
the parameter correction module is used for correcting the network parameters of the initial image recognition model based on the similarity loss result and the classification loss result to generate a target image recognition model; the target image recognition model is used for predicting the object component category and the positioning result corresponding to the key points in the source image.
Wherein, the first generation module includes:
the characteristic extraction unit is used for inputting the sample image into the initial image recognition model and acquiring target posture characteristics corresponding to the sample object in the sample image according to the initial image recognition model;
the characteristic classification unit is used for identifying the target posture characteristic according to a classifier in the initial image identification model to obtain a first classification result corresponding to the sample image;
the characteristic mapping unit is used for acquiring sample convolution characteristics, output by a target convolution layer in the initial image recognition model, of the sample image, and performing product operation on the first classification result and the sample convolution characteristics to obtain a candidate activation mapping corresponding to the sample image;
and the up-sampling processing unit is used for carrying out up-sampling processing on the candidate activation maps to obtain a first activation map with the same image size as the sample image.
Wherein the feature extraction unit includes:
the overall classification subunit is used for acquiring overall attitude features corresponding to sample objects in the sample images in the initial image recognition model and outputting overall classification results corresponding to the overall attitude features through a classifier in the initial image recognition model;
the block processing subunit is used for performing product operation on the global classification result and the sample convolution characteristics to obtain a global mapping map corresponding to the sample image, and performing block processing on the sample image according to the global mapping map to obtain M local area images; m is a positive integer;
the local feature extraction subunit is used for sequentially inputting the M local area images into the initial image recognition model and acquiring local attitude features corresponding to the M local area images in the initial image recognition model;
and the feature combination subunit is used for performing feature combination on the global attitude features and the local attitude features corresponding to the M local area images to obtain target attitude features corresponding to the sample objects in the sample images.
The initial image identification model comprises N residual error components, each residual error component comprises one or more convolution layers, and N is a positive integer;
the global classification subunit is specifically configured to:
acquiring input characteristics of an ith residual error component in the N residual error components; when i is 1, the input characteristic of the ith residual error component is a sample image, and i is a positive integer smaller than N;
performing convolution operation on the input characteristic of the ith residual error component according to one or more convolution layers in the ith residual error component to obtain a candidate convolution characteristic;
combining the candidate convolution characteristic and the input characteristic of the ith residual error component to obtain the residual error output characteristic of the ith residual error component, and taking the residual error output characteristic of the ith residual error component as the input characteristic of the (i + 1) th residual error component; the ith residual error component is connected with the (i + 1) th residual error component;
and determining the residual output characteristic of the Nth residual component as the global attitude characteristic corresponding to the sample object in the sample image.
The global attitude features are K in number, and K is a positive integer;
the global classification subunit is specifically configured to:
counting feature average values corresponding to the K global attitude features respectively, and combining the feature average values corresponding to the K global attitude features into a global feature vector;
converting the global feature vector into a feature vector to be classified according to an activation function in the initial image recognition model;
and inputting the feature vector to be classified into a classifier in the initial image recognition model, and outputting a global classification result corresponding to the feature vector to be classified through the classifier in the initial image recognition model.
Wherein the loss result determining module includes:
the data transformation unit is used for carrying out data transformation on the second activation mapping chart to obtain a deformed activation mapping chart;
and the similarity constraint unit is used for carrying out similarity constraint on the first activation mapping map and the deformation activation mapping map and determining a similarity loss result of the initial image recognition model.
Wherein the loss result determining module includes:
the sample loss determining unit is used for acquiring a first error between the first classification result and a key point classification label carried by the sample image, and determining a sample loss result of the initial image recognition model according to the first error;
the deformation loss determining unit is used for acquiring a second error between the second classification result and the key point classification label and determining a deformation loss result of the initial image recognition model according to the second error;
and the classification loss determining unit is used for determining the classification loss result of the initial image recognition model according to the sample loss result and the deformation loss result.
Wherein, parameter correction module includes:
the total loss determining unit is used for determining a model total loss result corresponding to the initial image recognition model according to the similarity loss result and the classification loss result;
and the network parameter adjusting unit is used for correcting the network parameters of the initial image recognition model by performing minimum optimization processing on the total model loss result, and determining the initial image recognition model containing the corrected network parameters as a target image recognition model.
Wherein, the device still includes:
the object component classification module is used for acquiring a source image, acquiring object posture characteristics corresponding to a target object in the source image through a target image recognition model, and recognizing object component classification results corresponding to the object posture characteristics; the object component classification result is used for representing the object part category corresponding to the key point of the target object;
the object mapping map generation module is used for generating an object position mapping map according to the object component classification result and the object convolution characteristics of the source image;
and the positioning result determining module is used for acquiring a pixel average value corresponding to the mapping image of the object part, determining a positioning result of a key point in the target object in the source image according to the pixel average value, and determining an attitude estimation result corresponding to the target object in the source image according to the type of the object part and the positioning result.
Wherein, object part classification module includes:
the global object classification unit is used for inputting a source image into the target image recognition model, acquiring global object characteristics corresponding to a target object in the source image from the target image recognition model, and outputting a global object classification result corresponding to the global object characteristics according to a classifier in the target image recognition model;
the global map generating unit is used for acquiring object convolution characteristics, output by a target convolution layer in the target image recognition model, aiming at the source image, and performing product operation on the global object classification result and the object convolution characteristics to obtain a global object map corresponding to the source image;
the component feature acquisition unit is used for carrying out blocking processing on a source image according to the global object mapping map to obtain M object component area images, and acquiring object component features corresponding to the M object component area images according to a target image recognition model; m is a positive integer;
and the component feature combination unit is used for combining the global object features and the object component features corresponding to the M object component area images into object posture features.
Wherein, the device still includes:
and the auditing module is used for determining that the auditing result of the source image in the content auditing system is an auditing passing result when the attitude estimation result is the same as the attitude of the target object in the content auditing system, and setting the access right aiming at the content auditing system for the object corresponding to the source image.
In one aspect, an embodiment of the present application provides a computer device, which includes a memory and a processor, where the memory is connected to the processor, the memory is used for storing a computer program, and the processor is used for invoking the computer program, so that the computer device executes a method provided in the above aspect in the embodiment of the present application.
An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, where the computer program is adapted to be loaded and executed by a processor, so as to enable a computer device with the processor to execute the method provided by the above aspect of the embodiments of the present application.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by the above-mentioned aspect.
According to the method and the device, the target posture characteristics in the sample image can be extracted through the initial image recognition model, the target posture characteristics are classified and recognized to obtain a first classification result of the target posture characteristics, and a first activation mapping chart is generated based on the first classification result and the sample convolution characteristics of the sample image; meanwhile, data transformation can be carried out on the sample image to obtain a deformed image, deformation posture characteristics in the deformed image are extracted through the initial image recognition model, and a second activation mapping chart is generated according to a second classification result of the deformation posture characteristics and deformation convolution characteristics of the deformed image; furthermore, similarity constraint (namely a similarity loss result) can be applied to the first activation mapping chart and the second activation mapping chart, and the positioning accuracy of key points in the images can be improved by the target image recognition model obtained through training; in addition, when the initial image recognition model is trained, the position information of each key point of the sample object in the sample image does not need to be marked, namely the sample image is wider in acquisition source, the marking operation of the key point position of the sample image can be reduced, and the image processing efficiency can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;
fig. 2 is a schematic view of a human body posture estimation scene according to an embodiment of the present application;
fig. 3 is a schematic flowchart of an image data processing method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of data transformation of a sample image according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram illustrating training of an initial image recognition model according to an embodiment of the present disclosure;
fig. 6 is a schematic flowchart of an image data processing method according to an embodiment of the present application;
fig. 7 is a schematic diagram of a global average pooling process provided by an embodiment of the present application;
FIG. 8 is a training diagram of an initial image recognition model according to an embodiment of the present disclosure;
fig. 9 is a schematic flowchart of an image data processing method according to an embodiment of the present application;
FIG. 10 is a schematic view of a scene of human body posture estimation provided by an embodiment of the present application;
fig. 11 is a schematic structural diagram of an image data processing apparatus according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
The present application relates to Computer Vision technology (CV). Computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. The application relates to Human body posture Estimation (Human position Estimation) belonging to the computer vision technology, wherein the Human body posture Estimation is an important task in computer vision and is also an essential step for understanding Human actions and behaviors by a computer; the estimation of the human body posture can be converted into a prediction problem of the human body key points, for example, the position coordinates of each human body key point in the image can be predicted, and the human body skeleton in the image can be predicted according to the position relation among each human body key point.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present disclosure. As shown in fig. 1, the network architecture may include a server 10d and a user terminal cluster, which may include one or more user terminals, where the number of user terminals is not limited. As shown in fig. 1, the user terminal cluster may specifically include a user terminal 10a, a user terminal 10b, a user terminal 10c, and the like. The server 10d may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform. The user terminal 10a, the user terminal 10b, the user terminal 10c, and the like may each include: smart terminals such as smart phones, tablet computers, notebook computers, palm computers, mobile Internet Devices (MID), wearable devices (e.g., smart watches, smart bracelets, etc.), and smart televisions. As shown in fig. 1, the user terminal 10a, the user terminal 10b, the user terminal 10c, etc. may be respectively connected to a network of the server 10d, so that each user terminal may interact data with the server 10d through the network connection.
The server 10d shown in fig. 1 may obtain a source image that needs to be recognized, and may output an object component category and a positioning result corresponding to each key point of a target object in the source image by using a target image recognition model, and may determine a posture estimation result corresponding to the target object in the source image according to the positioning result and the object component category. The server 10d may also train the initial image recognition model through a large number of sample images carrying the key point category labels, and the initial image recognition model after the training may be referred to as a target image recognition model. Any of the user terminals in the user terminal cluster shown in fig. 1 may be used to present the pose estimation result of the source image.
The initial image recognition model can be an image recognition model which is not trained in the training process, the image recognition model can be used for estimating the human body posture of an image or a video, and the target image recognition model can be an initial image recognition model after the training is finished; the source image and the sample image can both refer to human body images, and the target object in the source image and the sample object in the sample image can both refer to human bodies, various animals and the like; the respective key points may refer to respective joint points of the human body. It should be noted that the sample image in the present application may refer to a human body image only carrying a key point category label, that is, when an initial image recognition model is trained, a training process for each key point category may be considered to be supervised learning (each sample image has one label to indicate its real key point category label), and a training process for each key point position may be considered to be unsupervised learning (each sample image does not carry a real position label), so that the training process of the entire initial image recognition model may be considered to be weakly supervised learning, that is, the sample image only carries partial label information (key point category label). Alternatively, the estimation of the human body posture of the picture or the video may be performed by the server 10d, or may be performed by any one of the user terminals shown in fig. 1.
Referring to fig. 2, fig. 2 is a schematic view of a human body posture estimation scene according to an embodiment of the present disclosure. Taking the application of human body posture estimation to a behavior recognition scenario as an example, as shown in fig. 2, a server (e.g., the server 10d shown in fig. 1) may acquire a video 20a, where the video 20a may refer to a human body motion captured by a capturing device, or may refer to a behavior video directly downloaded from the internet, where the capturing device may refer to a different type of video camera or camera, etc. The server may perform framing processing on the video 20a to obtain multi-frame picture data, for example, extract a picture 20b, a picture 20c, and a picture 20d from the video 20a, where the picture 20b, the picture 20c, and the picture 20d may be a series of actions of human behaviors.
The server may obtain an image recognition model 20e, where the image recognition model 20e may refer to a human body pose estimation model trained in advance, and the image recognition model 20e may be used to predict the category and the positioning result of each human body key point in the video 20 a. The image recognition model 20e may be a convolutional neural network model, and the type of the image recognition model 20e is not limited in the present application; the training process of the image recognition model 20e can be seen in the embodiment corresponding to fig. 3 described below.
The server may sequentially input the picture 20b, the picture 20c, and the picture 20d to the image recognition model 20e, the image recognition model 20e may obtain the pose feature 1 corresponding to the picture 20b, the pose feature 2 corresponding to the picture 20c, and the pose feature 3 corresponding to the picture 20c, and further may sequentially input the pose features corresponding to the respective pictures to the classifier 20f associated with the image recognition model 20e, and the classifier 20f may sequentially output the classification result 1 corresponding to the pose feature 1, the classification result 2 corresponding to the pose feature 2, and the classification result 3 corresponding to the pose feature 3. For the classification result 1 corresponding to the picture 20b, the server may multiply the classification result 1 with the convolution feature output by the last convolution layer in the image recognition model 20e to generate a feature image 20g (which may be referred to as a Class Activation Mapping (CAM) corresponding to the picture 20 b), where the CAM is a tool for visualizing image features); the classification result 1 may be used as a weight of a convolution feature output by the last convolutional layer in the image recognition model 20e, and the classification result 1 and the convolution feature output by the last convolutional layer are weighted to obtain the feature image 20g, where the feature image 20g may be a result obtained by visualizing the convolution feature output by the last convolutional layer, and the feature image 20g may be used to characterize an image pixel region focused by the image recognition model 20e (for example, the region 20p in the feature image 20 g). Similarly, for the classification result 2 corresponding to the picture 20c and the classification result 3 corresponding to the picture 20d, the server may obtain the feature image 20h corresponding to the picture 20c and the feature image 20i corresponding to the picture 20d by the same operation as described above.
The server may calculate a positioning result of each human body key point in the picture 20b according to the pixel average value of the feature image 20g, and may obtain a posture estimation result 20j corresponding to the picture 20b based on the classification result 1 and the positioning result of each human body key point in the picture 20b, where the posture estimation result 20j may be used to represent a human body skeleton in the picture 20 b. Similarly, the server may calculate a positioning result of each human body key point in the picture 20c according to the pixel average value of the feature image 20h, and may obtain a posture estimation result 20k corresponding to the picture 20c based on the classification result 2 and the positioning result of each human body key point in the picture 20c, where the posture estimation result 20k may be used to represent a human body skeleton in the picture 20 c; according to the pixel average value of the feature image 20i, the positioning result of each human body key point in the picture 20d is calculated, and based on the classification result 3 and the positioning result of each human body key point in the picture 20d, a posture estimation result 20m corresponding to the picture 20d can be obtained, and the posture estimation result 20m can be used for representing a human body skeleton in the picture 20 d. The server may determine, according to the attitude estimation result 20j, the attitude estimation result 20k, and the attitude estimation result 20m, that the behavior recognition result corresponding to the video 20a is: run 20n.
In the embodiment of the application, by performing human body posture estimation on each picture frame in the video 20a, a posture estimation result corresponding to each picture frame can be obtained, and then a behavior recognition result of the video 20a can be determined based on the posture estimation result corresponding to each picture frame; in other words, the human body posture estimation result is an indispensable step in a behavior recognition scene, and the effect of the human body posture estimation directly affects the accuracy of the behavior recognition.
Referring to fig. 3, fig. 3 is a schematic flowchart of an image data processing method according to an embodiment of the present disclosure. It is understood that the image data processing method may be executed by a computer device, which may be a server (e.g., the server 10d in the embodiment corresponding to fig. 1), or a user terminal (e.g., any one of the user terminals in the user terminal cluster shown in fig. 1), or a computer program (including program code); as shown in fig. 3, the image data processing method may include the following steps S101 to S104:
step S101, obtaining a sample image, outputting a first classification result corresponding to the sample image through an initial image recognition model, and generating a first activation mapping chart according to the first classification result and the sample convolution characteristics of the sample image; the first classification result is determined by target posture characteristics corresponding to the sample object in the sample image, and the first activation map is used for representing the position information of key points of the sample object in the sample image.
Specifically, the computer device may obtain a sample image used for training the initial image recognition model, where the sample image may be a grayscale image or an RGB image, the RGB color mode is a color standard, RGB represents colors of three channels, red, green, and blue, the number of the sample images may be multiple, each sample image may be an image containing a sample object, each sample image may carry a key point category label, and the key point category label may be used to represent a category corresponding to each key point of the sample object contained in the sample image. The sample object contained in the sample image may include, but is not limited to: humans, animals (e.g., monkeys, chimpanzees, dogs, etc.); the respective key points of the sample object may refer to respective joints of a human body or an animal, different types of sample objects may correspond to the different key points, and each type of sample object may correspond to a specific number of key points. For example, when the sample image is a human body image, each key point of the sample object contained in the sample image may include 18 joints of the head, shoulders, upper limbs, lower limbs, and the like, and the connection of these key points may describe the human body pose in the sample image; the key point category labels carried by the sample image can comprise human head joints, human upper limb joints, human lower limb joints and the like.
The computer equipment can obtain an initialized image recognition model, namely the initial image recognition model, and in the process of training the initial image recognition model, a single sample image can be input in each training, and a batch of sample images can also be input. For any sample image in the plurality of sample images, the computer device may input the sample image into an initial image recognition model, and a target posture feature corresponding to the sample object in the sample image may be obtained through the initial image recognition model, and the target posture feature may be used to describe the posture of the sample object in the sample image; identifying the target posture characteristics through a classifier of an initial image identification model to obtain a first classification result corresponding to the target posture characteristics; and generating a first activation map according to the first classification result and the sample convolution characteristics corresponding to the sample image, wherein the first activation map can be used for representing the position information of each key point of the sample object in the sample image. The sample convolution feature may refer to a convolution feature for the sample image output by a target convolution layer in the initial image recognition model, and the target convolution layer may refer to a last convolution layer in the initial image recognition model.
The initial image recognition model may be a convolutional neural network, or a combined network model of the convolutional neural network and the rest of the neural networks (e.g., a recurrent neural network). The initial image recognition model may include, but is not limited to: a MobileNet V1 (a lightweight Convolutional Neural Network), a MobileNet V2 (an improvement of the MobileNet V1, which is also a lightweight Convolutional Neural Network), a PoseNet (a visual localization model, which can locate pose information of a human body by a color image), a ResNet (a Residual Neural Network ), a densneen (a Dense Convolutional Neural Network), a LSTM (Long Short-Term Memory, a Long-Term Memory Network), an RNN (a Recurrent Neural Network), a GRU (Gate recycling Unit), or a combination model of any one or more of the above networks, where the initial image recognition model can also be a Network model designed based on actual requirements, such as the designed initial image recognition model can be used to extract pose features of multiple sizes (i.e., the above target pose features can include pose features with different sizes), and can be used to better characterize the pose features of multiple sizes with different initial image recognition models, so that the initial image recognition model has better performance for different sized objects; the type of the initial image recognition model is not limited in the present application.
Optionally, after obtaining the target posture feature corresponding to the sample object in the sample image according to the initial image recognition model and recognizing the first classification result corresponding to the target posture feature, the computer device may obtain a sample convolution feature output by the target convolution layer in the initial image recognition model and for the sample image, and perform product operation on the first classification result and the sample convolution feature to obtain a candidate activation map corresponding to the sample image; the target convolutional layer may be a last convolutional layer in the initial image recognition model, and since the size of the sample convolutional feature output by the last convolutional layer is smaller than the size of the sample image, that is, the size of the candidate activation map is smaller than the size of the sample image, the candidate activation map may be subjected to upsampling to obtain a first activation map having the same image size as the sample image. The first classification result may include probability values of the key points of the sample object belonging to the respective categories, the probability values in the first classification result may be regarded as weights corresponding to sample convolution features output by the target convolution layer, the sample convolution features are weighted based on the first classification result, a pixel region concerned by the initial image recognition model is visualized, and the first activation map may be a Category Activation Map (CAM). Optionally, the first activation map may also be an image obtained by superimposing the result of the upsampling process on the sample image.
Step S102, carrying out data transformation on the sample image to obtain a deformed image, outputting a second classification result corresponding to the deformed image through the initial image recognition model, and generating a second activation mapping chart according to the second classification result and the deformed convolution characteristic of the deformed image; the second classification result is determined by deformation posture characteristics corresponding to the sample object in the deformation image, and the second activation mapping graph is used for representing position information of key points of the sample object in the deformation image.
Specifically, the computer device may perform data Transformation on the sample data to obtain a deformed image corresponding to the sample image, where the number of deformed images may be one or more, different deformed images may be obtained by performing different data transformations on the sample image, and the data transformations may include Perspective Transformation (also referred to as projection Transformation) and affine Transformation (affine Transformation), and may include but are not limited to: rotation (rotation), translation (translation), scaling (scaling), shearing (shear), reflection (reflection), and any combination of the above transformation in any order and times.
For one or more deformed images obtained by data transformation, the computer device can sequentially input the one or more deformed images into the initial image recognition model, the deformed posture features corresponding to the sample objects in the deformed images can be obtained through the initial image recognition model, the deformed posture features are recognized through the classifier of the initial image recognition model, a second classification result corresponding to the deformed images can be obtained, and a second activation mapping graph is generated based on the second classification result and the deformed convolution features of the deformed images. In other words, the processing procedure of the computer device for the deformed image by using the initial image recognition model is the same as the processing procedure of the sample image, and is not repeated here. When the computer equipment performs data transformation on the sample image in a transformation mode, a deformation image corresponding to the sample image can be obtained, and a second activation mapping image corresponding to the deformation image can be generated through the initial image recognition model; when the computer device performs data transformation on the sample image in multiple transformation modes, multiple deformed images corresponding to the sample image can be obtained, and second activation maps corresponding to the multiple deformed images can be generated through the initial image recognition model, that is, one deformed image can correspond to one second activation map.
Referring to fig. 4, fig. 4 is a schematic diagram illustrating data transformation of a sample image according to an embodiment of the present disclosure. As shown in fig. 4, after the computer device acquires the sample image 30a, the computer device may perform translation and reduction operations on the sample object in the sample image 30a to obtain a deformed image 30b corresponding to the sample image 30 a; alternatively, by performing a reflection operation on the sample image 30a, a deformed image 30c corresponding to the sample image 30a can be obtained; alternatively, by performing a rotation operation on the sample image 30a, a deformed image 30d corresponding to the sample image 30a can be obtained; alternatively, by performing a scaling operation on the sample image 30a, a deformed image 30e corresponding to the sample image 30a can be obtained. The computer device may sequentially input the deformed image 30b, the deformed image 30c, the deformed image 30d, and the deformed image 30e to an initial image recognition model, and the initial image recognition model may use a second activation map corresponding to each of the deformed images.
Step S103, determining a similarity loss result of the initial image recognition model according to the first activation mapping map and the second activation mapping map, and determining a classification loss result of the initial image recognition model according to the first classification result, the second classification result and the key point class label carried by the sample image.
Specifically, the computer device may determine the similarity loss result of the initial image recognition model according to a first activation map corresponding to the sample image and a second activation map corresponding to the transformed data deformation image. The determining process of the similarity loss result may include: the computer device can perform the same data transformation on the second activation map to obtain a deformation activation map, and further can perform similarity constraint (also called consistency constraint) on the first activation map and the deformation activation map to determine a similarity loss result of the initial image recognition model. It can be understood that the sample object in the sample image and the sample object in each deformation image are the same object, and theoretically, the first activation map and the deformation activation map corresponding to the sample image should have the same position information, so that when the initial image recognition model is trained, the similarity constraint can be performed on the first activation map and the deformation activation map, and the model can learn position information of different key points with unchanged geometry, so as to improve the positioning accuracy of the model.
Since the sample image carries the keypoint class label corresponding to the sample object, regardless of the sample image and the transformed data, the sample object contained therein is the same, so the true classes of the keypoints in the sample object and the transformed data are the same, and the true class of the keypoint is the keypoint class label carried by the sample image. The computer device may obtain a first difference between the first classification result and the keypoint class label, and may determine a sample loss result of the initial image recognition model according to the first difference, where the first difference may be a distance between the first classification result and the keypoint class label; similarly, a second difference between the second classification result and the key point category label can be obtained, and a deformation loss result of the initial image recognition model can be determined according to the second difference, wherein the second difference can be a distance between the second classification result and the key point category label; and determining a classification loss result of the initial image recognition model according to the sample loss result and the deformation loss result. For example, the sample loss result L1 and the deformation loss result L2 may be added, and the added result (L1 + L2) may be used as the classification loss result of the initial image recognition model; or a coefficient a may be set for the sample loss result L1, a coefficient b may be set for the deformation loss result L2, and (a × L1+ b × L2) may be used as the classification loss result of the initial image recognition model; the present application does not limit the form of combination of the sample loss result and the deformation loss result.
Step S104, correcting the network parameters of the initial image recognition model based on the similarity loss result and the classification loss result to generate a target image recognition model; the target image recognition model is used for predicting the object component category and the positioning result corresponding to the key points in the source image.
Specifically, the computer device may modify a network parameter of the initial image recognition model based on the similarity loss result and the classification loss result, and determine the trained initial image recognition model as the target image recognition model.
Optionally, the computer device may determine a total model loss result corresponding to the initial image recognition model according to the similarity loss result and the classification loss result, where the total model loss result may be a sum of the similarity loss result and the classification loss result, or may be a result obtained by multiplying the similarity loss result and the classification loss result by respective corresponding coefficients and then adding the results. The network parameters of the initial image recognition model can be corrected by performing minimum optimization processing on the model total loss result of the initial image recognition model, and the initial image recognition model containing the corrected network parameters is determined as a target image recognition model which can be the initial image recognition model after training. In other words, by minimizing the total loss result of the model, the network parameters of the initial image recognition model are continuously trained, that is, the network parameters of the initial image recognition model are continuously adjusted, when the training times of the initial image recognition model reach the preset maximum iteration times or the training of the initial image recognition model reaches convergence, the network parameters at this time can be saved, and the initial image recognition model containing the network parameters is determined as the target image recognition model; the target image recognition model may be configured to predict object component categories and positioning results corresponding to key points in the source image, and the object component categories and positioning results may be configured to determine pose estimation results (e.g., pose estimation results 20j in the embodiment corresponding to fig. 2) corresponding to target objects in the source image (e.g., picture 20b in the embodiment corresponding to fig. 2).
Referring to fig. 5, fig. 5 is a schematic diagram illustrating training of an initial image recognition model according to an embodiment of the present disclosure. As shown in fig. 5, after obtaining the sample image 40a, the computer device may input the sample image 40a into a residual network 40b (where the residual network 40b may be a ResNet network, which may be considered as the above-mentioned initial image recognition model), obtain a target pose feature 40c through the residual network 40b, and recognize the target pose feature 40c to obtain a first classification result 40d corresponding to the sample image 40 a; the first classification result 40d and the sample convolution features output by the last convolution layer in the residual error network 40b are multiplied to obtain a feature image 40e (i.e., the first activation map), where the number of the feature images 40e may be one or more, and is the same as the number of the key point categories of the sample object included in the sample image 40a, that is, the feature image 40e may include activation maps corresponding to the key point categories in the sample image 40a, respectively. The computer device may determine a sample loss result 40g for the residual network 40b from a first difference between the first classification result 40d and the keypoint class label 40f carried by the sample image 40 a.
Further, the computer device may rotate the sample object in the sample image 40a to obtain a deformed image 40h, and further may input the deformed image 40h to the residual error network 40b, obtain a deformed posture feature 40i through the residual error network 40b, and identify the deformed posture feature 40i to obtain a second classification result 40j corresponding to the sample image 40 a; the second classification result 40j and the deformed convolution feature output by the last convolution layer in the residual error network 40b are multiplied to obtain a feature image 40k (i.e., the second activation map), where the feature image 40k may include activation maps corresponding to the key point categories in the deformed image 40 h. The computer device may determine a distortion loss result 40m for the residual network 40b based on a second difference between the second classification result 40j and the keypoint class label 40 f.
The computer device can also apply similarity constraint to the characteristic image 40e and the characteristic image 40k, determine a similarity loss result 40n of the residual error network 40b, determine a model total loss result of the residual error network 40b through the sample loss result 40g, the deformation loss result 40m and the similarity loss result 40n, and continuously train network parameters of the residual error network 40b through performing minimum optimization processing on the model total loss result until the training times reach a preset maximum iteration times (or the training reaches convergence) to obtain a trained target image recognition model.
In the embodiment of the application, target posture characteristics in a sample image are extracted through an initial image recognition model, a first classification result of the target posture characteristics is obtained through classification and recognition of the target posture characteristics, and a first activation mapping chart is generated based on the first classification result and sample convolution characteristics of the sample image; meanwhile, data transformation can be carried out on the sample image to obtain a deformed image, deformation posture characteristics in the deformed image are extracted through the initial image recognition model, and a second activation mapping chart is generated according to a second classification result of the deformation posture characteristics and deformation convolution characteristics of the deformed image; furthermore, similarity constraint (namely a similarity loss result) can be applied to the first activation mapping chart and the second activation mapping chart, and the positioning accuracy of key points in the images can be improved by the target image recognition model obtained through training; in addition, when the initial image recognition model is trained, the position information of each key point of the sample object in the sample image does not need to be marked, namely the sample image is wider in acquisition source, the marking operation of the key point positions of the sample image can be reduced, and the image processing efficiency can be improved.
Referring to fig. 6, fig. 6 is a schematic flowchart of an image data processing method according to an embodiment of the present disclosure. It is understood that the image data processing method may be executed by a computer device, which may be a server, or a user terminal, or a computer program (including program code); as shown in fig. 6, the image data processing method may include the following steps S201 to S208:
step S201, a sample image is obtained, global attitude characteristics corresponding to a sample object in the sample image are obtained in an initial image recognition model, and a global classification result corresponding to the global attitude characteristics is output through a classifier in the initial image recognition model.
Specifically, after the computer device acquires the sample image, the sample image can be input to the initial image recognition model, and the global posture characteristic corresponding to the sample object in the sample image can be obtained through the initial image recognition model, and the global posture characteristic can be used for describing the overall posture of the sample object in the sample image; and identifying the global attitude characteristics through a classifier of the initial image identification model to obtain a global classification result corresponding to the global attitude characteristics.
Taking the initial image recognition model as a residual network ResNet as an example, the initial image recognition model may include N residual components, each of which may include one or more convolutional layers, where N is a positive integer, e.g., N may take a value of 1,2, \8230;. The above global attitude feature extraction process may include: the computer device may obtain an input feature of an ith residual component of the N residual components, where when i is 1, the input feature of the ith residual component may be the sample image, and i may be a positive integer smaller than N; optionally, before the N residual components of the initial image recognition model, the initial image recognition model may further include one or more independent convolutional layers, and the input feature of the 1 st convolutional layer (i is 1) may be a convolutional feature output by the sample image after passing through one or two independent convolutional layers in the initial image recognition model.
Performing convolution operation on the input features of the ith residual component through one or more convolution layers in the ith residual component to obtain candidate convolution features, further combining the candidate convolution features and the input features of the ith candidate residual component (for example, the combination can be feature addition) to obtain residual output features of the ith residual component, taking the residual output features of the ith residual component as the input features of the (i + 1) th residual component, and determining the residual output features of the nth residual component as global attitude features corresponding to the sample object in the sample image; wherein, the ith residual error component is connected with the (i + 1) th residual error component. Optionally, when the size of the candidate convolution feature is not consistent with the size of the input feature of the ith candidate residual error component, linear transformation may be performed on the input feature of the ith candidate residual error component, so that the size of the transformed feature is the same as the size of the candidate convolution feature, and then the transformed feature and the candidate convolution feature may be added to obtain the residual error output feature of the ith residual error component. In other words, the N residual components in the initial image recognition model are sequentially connected, the residual output feature of the previous residual component (e.g., the ith residual component) can be used as the input feature of the next residual component (i + 1) th residual component), and finally the residual output feature of the last residual component (the nth residual component) can be used as the global pose feature corresponding to the sample object in the sample image.
Optionally, the number of the global attitude features is K, K may be a positive integer, where K may be considered as the number of channels of the global attitude features, and for example, K may take the value of 1,2, \ 8230 \ 8230; the computer equipment can count feature average values corresponding to the K global attitude features respectively, and combine the feature average values corresponding to the K global attitude features into a global feature vector; and then converting the global feature vector into a feature vector to be classified according to an activation function in the initial image recognition model, inputting the feature vector to be classified into a classifier of the initial image recognition model, and outputting a global classification result corresponding to the feature vector to be classified through the classifier of the initial image recognition model. In other words, the computer device may perform global average pooling on the K target pose features, and convert each global pose feature into a numerical value, that is, the K target pose features may be converted into a K-dimensional global feature vector, and the global feature vector is input to the classifier after passing through the activation function sigmoid, so as to obtain a global classification result. Spatial information in the sample image can be reserved through global average pooling, so that the positioning accuracy of the model is improved.
Referring to fig. 7, fig. 7 is a schematic diagram of a global average pooling process according to an embodiment of the present application. As shown in fig. 7, the size of the global pose feature 50a can be expressed as: 4 × 3, i.e., the number K of global pose features 50a may be 3, the width W4, and the height H4; by performing global average pooling on the global pose feature 50a, a global feature vector 50b can be obtained, and the size of the global feature vector 50b can be expressed as: after passing through the statistical feature average, the feature map of 1 × 3, i.e., each 4 × 4 of the global pose features 50a, may be converted into a numerical value, i.e., a value of 1 × 1.
Step S202, performing product operation on the global classification result and the sample convolution characteristics to obtain a global mapping image corresponding to the sample image, and performing block processing on the sample image according to the global mapping image to obtain M local area images; m is a positive integer.
Specifically, the computer device may perform a product operation on the global classification result and the sample convolution feature of the sample image to obtain a global map corresponding to the sample image, where the global map may include a Class Activation Map (CAM) corresponding to each key point of the sample object in the sample image; the generation process of the global map may refer to the generation process of the first activation map in step S101 in the embodiment corresponding to fig. 3, and details are not repeated here.
Further, the computer device may use the class activation maps of the respective key points as prior information of the region positions, perform blocking processing (crop) on the sample image, and may obtain M local region images, where M may be a positive integer, and for example, M may take a value of 1,2, \ 8230; \8230;. In other words, the sample image can be cropped according to the global map to obtain the local area images corresponding to the respective components.
Step S203, sequentially inputting the M local area images to the initial image recognition model, and obtaining local pose features respectively corresponding to the M local area images in the initial image recognition model.
Specifically, the computer device may input the cropped M local area images to the initial image recognition model again, and may obtain a finer-grained feature through the initial image recognition model, that is, a local pose feature corresponding to each local area image, where the local pose feature may be used to express a pose of each component of the sample object included in the sample image. The process of processing a single local area image by using the initial image recognition model may refer to the process of processing the sample image in step S201, which is not described herein again.
And step S204, performing feature combination on the global attitude features and the local attitude features corresponding to the M local area images to obtain target attitude features corresponding to the sample objects in the sample images.
Specifically, the computer device may perform feature combination on the global pose feature learned by the initial image recognition model and the M local pose features, for example, the global pose feature is spliced with the M local pose features to obtain a target pose feature corresponding to the sample object in the sample image; the target pose feature herein may include both the local pose features of the respective components of the sample object and the global pose features of the sample object. By introducing the block learning based on the component perception into the initial image recognition model, the fine granularity of the target posture characteristic can be enhanced, and the positioning accuracy of the model can be further improved.
Step S205, identifying the target posture characteristic according to a classifier in the initial image identification model to obtain a first classification result corresponding to the sample image; and generating a first activation mapping map according to the first classification result and the sample convolution characteristics of the sample image.
Step S206, carrying out data transformation on the sample image to obtain a deformed image, outputting a second classification result corresponding to the deformed image through the initial image identification model, and generating a second activation mapping image according to the second classification result and the deformed convolution characteristic of the deformed image; the second classification result is determined by deformation posture characteristics corresponding to the sample object in the deformation image, and the second activation map is used for representing the position information of the key points of the sample object in the deformation image.
And step S207, determining a similarity loss result of the initial image recognition model according to the first activation mapping chart and the second activation mapping chart, and determining a classification loss result of the initial image recognition model according to the first classification result, the second classification result and the key point class label carried by the sample image.
S208, correcting the network parameters of the initial image recognition model based on the similarity loss result and the classification loss result to generate a target image recognition model; the target image recognition model is used for predicting the object component category and the positioning result corresponding to the key points in the source image.
The specific implementation manner of step S205 to step S208 may refer to step S101 to step S104 in the embodiment corresponding to fig. 3, which is not described herein again. Optionally, in this embodiment of the application, the classification loss result of the initial image recognition model may include a global loss result in addition to the sample loss result and the deformation loss result; wherein the global loss result may be determined by a third difference between the global classification result and the keypoint class label.
Referring to fig. 8, fig. 8 is a schematic diagram illustrating training of an initial image recognition model according to an embodiment of the present disclosure. As shown in fig. 8, on the basis of the network structure of the initial image recognition model shown in fig. 5, block learning based on component sensing is added, and in the embodiment of the present application, only the newly added block learning based on component sensing is described, and details of the network structure that is the same as that shown in fig. 5 are not repeated in the embodiment of the present application. After the global attitude feature 60c corresponding to the sample image 40a is obtained by the computer device through the residual error network 60b, the global attitude feature 60c can be further processed by using a global average pooling and activating function, the processed result is classified to obtain a global classification result, and the global classification result is multiplied by the sample convolution feature output by the last convolution layer to obtain a global mapping graph. The sample image 60a may be further subjected to blocking processing according to the global map to obtain M local area images 60p, which are sequentially input to the residual error network 60b, and local pose features 60q corresponding to the M local area images may be obtained through the residual error network 60 b. The M local pose features 60q and the global pose feature 60c of the sample image 60a are feature-combined to obtain a target pose feature, a first classification result 60d may be obtained by identifying the target pose feature, and a feature image 60e (i.e., the first activation map) may be obtained by performing a product operation on the first classification result 60d and a sample convolution feature output by a last convolution layer in the residual error network 60b, where the first activation map is more sensitive to a component position of the sample image 60a, and a subsequent processing process may be described in an embodiment corresponding to fig. 5.
In the embodiment of the application, target posture characteristics in a sample image are extracted through an initial image recognition model, a first classification result of the target posture characteristics is obtained through classification and recognition of the target posture characteristics, and a first activation mapping chart is generated based on the first classification result and sample convolution characteristics of the sample image; meanwhile, data transformation can be carried out on the sample image to obtain a deformed image, deformation posture characteristics in the deformed image are extracted through the initial image recognition model, and a second activation mapping chart is generated according to a second classification result of the deformation posture characteristics and deformation convolution characteristics of the deformed image; furthermore, similarity constraint (namely a similarity loss result) can be applied to the first activation mapping chart and the second activation mapping chart, and the positioning accuracy of key points in the images can be improved by the target image recognition model obtained through training; in addition, when the initial image recognition model is trained, the position information of each key point of the sample object in the sample image does not need to be marked, namely the sample image has wider acquisition source, the marking operation of the key point position of the sample image can be reduced, and the image processing efficiency can be further improved; the part sensing-based block learning is introduced into the initial image recognition model, so that the characteristics with finer granularity can be learned, the first activation mapping chart which is more sensitive to the position of the part can be further obtained, and the positioning accuracy of the model can be further improved.
Referring to fig. 9, fig. 9 is a schematic flowchart of an image data processing method according to an embodiment of the present disclosure. It is understood that the image data processing method may be executed by a computer device, which may be a server, or a user terminal, or a computer program (including program code); as shown in fig. 9, the image data processing method may include the following steps S301 to S303:
step S301, acquiring a source image, acquiring object posture characteristics corresponding to a target object in the source image through a target image recognition model, and recognizing an object component classification result corresponding to the object posture characteristics; the object component classification result is used for representing the object part category corresponding to the key point of the target object.
Specifically, after the initial image recognition model is trained, the trained initial image recognition model may be referred to as a target image recognition model. The computer device may acquire a source image which may contain a target object to be pose-estimated, which may include, but is not limited to: human, animal, etc. Inputting a source image into a trained target image recognition model, obtaining object posture characteristics corresponding to a target object in the source image through the target image recognition model, and outputting an object component classification result corresponding to the object posture characteristics through a classifier of the target image recognition model, wherein the object component classification result can be used for representing object part categories corresponding to key points (such as human body joints) of the target object. The object posture feature may be a global object feature extracted by the target image recognition model and directed to the target object, or may be a fusion feature between a global object feature and a local object feature corresponding to the target object. When the object posture characteristics are global object characteristics corresponding to target objects in the source images, the fact that the block learning based on component perception is not introduced in the process of extracting the characteristics of the source images by using the target image recognition model is shown; when the object posture characteristics are fusion characteristics between global object characteristics and local object characteristics corresponding to target objects in the source images, the method indicates that in the process of extracting the characteristics of the source images by using a target image recognition model, block learning based on component perception is introduced.
Optionally, if block learning based on component sensing is introduced in the process of extracting features of the source image by using the target image recognition model, the computer device inputs the source image into the target image recognition model, obtains global object features corresponding to target objects in the source image in the target image recognition model, and outputs a global object classification result corresponding to the global object features according to a classifier in the target image recognition model; obtaining object convolution characteristics, output by a target convolution layer in a target image recognition model, for a source image, and performing product operation on the global object classification result and the object convolution characteristics to obtain a global object mapping map (for example, a characteristic image 20g in the embodiment corresponding to fig. 2) corresponding to the source image; according to the global object mapping image, a source image is subjected to blocking processing to obtain M object component area images, and object component characteristics corresponding to the M object component area images are obtained according to a target image recognition model; m is a positive integer; and combining the global object features and the object part features corresponding to the M object part area images into object posture features. The process of extracting the object posture feature here may refer to the process of extracting the target posture feature in steps S201 to S204 in the embodiment corresponding to fig. 6, and is not described here again.
It should be noted that, since the target image recognition model already has the capability of learning geometrically invariant position information, it is not necessary to process a transformed image after data transformation in the application process of the target image recognition model, that is, when the target image recognition model is used, it is not necessary to introduce a series of operations such as data transformation.
In step S302, an object portion map is generated based on the object component classification result and the object convolution feature of the source image.
Specifically, after obtaining the object component classification result, the computer device may multiply the object component classification result and the object convolution feature of the source image to generate an object location map, which is similar to the first activation map in the embodiment corresponding to fig. 6 and is not repeated here.
Step S303, obtaining a pixel average value corresponding to the mapping image of the target position, determining a positioning result of a key point in the target object in the source image according to the pixel average value, and determining an attitude estimation result corresponding to the target object in the source image according to the target position type and the positioning result.
Specifically, the computer device may take a pixel average value of the object location mapping image, determine the pixel average value as a positioning result of a key point in the target object in the source image, and determine an object skeleton of the target object in the source image according to the object location category and the positioning result, where the object skeleton may be used as an attitude estimation result corresponding to the target object in the source image.
In one or more embodiments, the source image may include one or more target objects, after the source image is obtained, the computer device may first detect, through a target image recognition model, a target object included in the source image, determine a region where a single target object is located in the source image, and further may perform pose estimation on the region where the single target object is located, that is, perform feature extraction on the region where the single target object is located, detect all key points (e.g., all human body joint points) included in the single target object, and an object part category and a positioning result corresponding to each key point, and may connect all detected key points according to the object part category and the positioning result of each key point, to obtain an object skeleton corresponding to the single target object, where the object skeleton may be used to represent a pose estimation result of the single target object.
Optionally, the human body posture estimation method (the target image recognition model) provided by the application may be applied to different application scenes, such as a security monitoring scene, a human-computer interaction scene (e.g., virtual reality, human-computer animation, etc.), a content auditing scene, an auxiliary motion training scene, an automatic driving scene, a game or movie character action design scene, and the like. In a security monitoring scene, according to time sequence information corresponding to a video, a video frame in the video can be subjected to human body posture estimation by using a target image recognition model, the behavior of people in the video is determined according to a human body posture estimation result (namely the posture estimation result), and abnormal behaviors such as fighting are further detected, so that long-time uninterrupted intelligent monitoring is realized, and manpower and material resources spent on manual monitoring can be saved, namely the security monitoring cost is saved. In a human-computer interaction scene, a source image (or video) of a user can be collected, a target image recognition model is utilized to estimate the human body posture of the collected source image (or video), the machine is controlled according to the human body posture estimation result (which can also be understood as human body action information), and a specific instruction is executed according to a specific human body action; in an auxiliary exercise training scene, a source image (or video) of a user can be collected, a target image recognition model is used for carrying out human body posture estimation on the collected source image (or video), whether the action of a sporter is standard or not is determined according to the human body posture estimation result, and which exercise postures need to be improved, so that an intelligent and professional exercise guidance coach can be provided for the user who wants to exercise. In a game character action design scene, human body posture estimation can be carried out through the target image recognition model, human body actions can be obtained, expensive action capturing equipment is replaced, and the cost and the difficulty of game character action design can be reduced.
Taking a content auditing scene as an example, when the attitude estimation result is the same as the attitude of a target object in the content auditing system, determining the auditing result of the source image in the content auditing system as an auditing passing result, and setting access authority aiming at the content auditing system for the object corresponding to the source image; after the attitude estimation result passes the audit in the content audit system, the object corresponding to the source image can have the authority of accessing the content audit system. Optionally, the posture estimation in the content auditing system may refer to posture estimation of the whole human body, or may refer to posture estimation of a human body component, which is not limited in this application.
Please refer to fig. 10, fig. 10 is a scene schematic diagram of human body posture estimation according to an embodiment of the present application. As shown in fig. 10, the user a may send an authentication request to the server 70d through the user terminal 70a, after receiving the authentication request sent by the user terminal 70a, the server 70d may obtain an identity verification manner for the user a and return the identity verification manner to the user terminal 70a, and an authentication box 70b may be displayed in a terminal screen of the user terminal 70 a. The user a may be aligned with the verification box 70b in the user terminal 70a in the front and perform a specific action (for example, raising hands, kicking legs, forking waist, etc.), and the user terminal 70a may collect the image to be verified 70c (which may be regarded as the source image) in the verification box 70b in real time and send the image to be verified 70c collected in real time to the server 70d.
The server 70d may obtain the image to be verified 70c sent by the user terminal 70a, and obtain a target object posture 70e set in the content auditing system by the user a in advance, where the target object posture 70e may be used as verification information of the user a in the content auditing system. The server 70d may perform attitude estimation on the image to be verified 70c by using the target image recognition model to obtain an attitude estimation result corresponding to the image to be verified 70 c; the similarity comparison is performed between the posture estimation result corresponding to the image to be verified 70c and the target object posture 70e, and when the similarity between the posture estimation result of the image to be verified 70c and the target object posture 70e is greater than or equal to a similarity threshold (for example, the similarity threshold may be set to 90%), it may be determined that the posture estimation result of the image to be verified 70c is the same as the target object posture 70e, and the user a passes the audit in the content auditing system. When the similarity between the posture estimation result of the image 70c to be verified and the target object posture 70e is smaller than the similarity threshold, it may be determined that the posture estimation result of the image 70c to be verified is different from the target object posture 70e, the user a fails to perform the verification in the content verification system, and an action error prompt message is returned to the user terminal 70a, where the action error prompt message is used to prompt the user a to perform the action again for identity verification.
In the embodiment of the application, the position estimation is carried out on the source image through the trained target image recognition model, so that the positioning accuracy of key points in the image can be improved; in addition, the part learning based on the part perception is introduced into the target image recognition model, so that the characteristics with finer granularity can be learned, the object part mapping image which is more sensitive to the part position can be further obtained, and the positioning accuracy of the model can be further improved.
Referring to fig. 11, fig. 11 is a schematic structural diagram of an image data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 11, the image data processing apparatus 1 may include: a first generation module 11, a second generation module 12, a loss result determination module 13 and a parameter correction module 14;
the first generation module 11 is configured to obtain a sample image, output a first classification result corresponding to the sample image through an initial image recognition model, and generate a first activation map according to the first classification result and a sample convolution feature of the sample image; the first classification result is determined by target posture characteristics corresponding to the sample object in the sample image, and the first activation mapping graph is used for representing the position information of key points of the sample object in the sample image;
the second generation module 12 is configured to perform data transformation on the sample image to obtain a deformed image, output a second classification result corresponding to the deformed image through the initial image recognition model, and generate a second activation map according to the second classification result and a deformation convolution feature of the deformed image; the second classification result is determined by deformation posture characteristics corresponding to the sample object in the deformation image, and the second activation mapping graph is used for representing the position information of key points of the sample object in the deformation image;
a loss result determining module 13, configured to determine a similarity loss result of the initial image recognition model according to the first activation map and the second activation map, and determine a classification loss result of the initial image recognition model according to the first classification result, the second classification result, and the keypoint category label carried by the sample image;
a parameter modification module 14, configured to modify a network parameter of the initial image recognition model based on the similarity loss result and the classification loss result, so as to generate a target image recognition model; the target image recognition model is used for predicting the object component category and the positioning result corresponding to the key points in the source image.
The specific functional implementation manners of the first generating module 11, the second generating module 12, the loss result determining module 13, and the parameter correcting module 14 may refer to steps S101 to S104 in the embodiment corresponding to fig. 3, which is not described herein again.
In one or more embodiments, the first generation module 11 may include: a feature extraction unit 111, a feature classification unit 112, a feature mapping unit 113, and an upsampling processing unit 114;
the feature extraction unit 111 is configured to input the sample image into the initial image recognition model, and obtain a target posture feature corresponding to the sample object in the sample image according to the initial image recognition model;
the feature classification unit 112 is configured to identify the target posture feature according to a classifier in the initial image identification model, so as to obtain a first classification result corresponding to the sample image;
the feature mapping unit 113 is configured to obtain a sample convolution feature, output by the target convolution layer in the initial image recognition model, for the sample image, and perform product operation on the first classification result and the sample convolution feature to obtain a candidate activation map corresponding to the sample image;
and an upsampling processing unit 114, configured to perform upsampling processing on the candidate activation maps to obtain a first activation map having the same image size as the sample image.
For specific functional implementation manners of the feature extraction unit 111, the feature classification unit 112, the feature mapping unit 113, and the upsampling processing unit 114, reference may be made to step S101 in the embodiment corresponding to fig. 3, which is not described herein again.
In one or more embodiments, the feature extraction unit 111 may include: a global classification subunit 1111, a block processing subunit 1112, a local feature extraction subunit 1113, and a feature combination subunit 1114;
the global classification subunit 1111 is configured to obtain, in the initial image recognition model, a global pose feature corresponding to the sample object in the sample image, and output, by using a classifier in the initial image recognition model, a global classification result corresponding to the global pose feature;
a block processing subunit 1112, configured to perform product operation on the global classification result and the sample convolution feature to obtain a global map corresponding to the sample image, and perform block processing on the sample image according to the global map to obtain M local area images; m is a positive integer;
a local feature extraction subunit 1113, configured to sequentially input the M local area images to the initial image recognition model, and obtain local pose features corresponding to the M local area images in the initial image recognition model;
the feature combination subunit 1114 is configured to perform feature combination on the global pose features and the local pose features corresponding to the M local area images to obtain target pose features corresponding to the sample object in the sample image.
Optionally, the initial image recognition model includes N residual error components, each residual error component includes one or more convolution layers, and N is a positive integer;
the global classification subunit 1111 may specifically be configured to:
acquiring input characteristics of an ith residual error component in the N residual error components; when i is 1, the input characteristic of the ith residual error component is a sample image, and i is a positive integer smaller than N;
performing convolution operation on the input characteristic of the ith residual error component according to one or more convolution layers in the ith residual error component to obtain a candidate convolution characteristic;
combining the candidate convolution characteristic and the input characteristic of the ith residual error component to obtain the residual error output characteristic of the ith residual error component, and taking the residual error output characteristic of the ith residual error component as the input characteristic of the (i + 1) th residual error component; the ith residual error component is connected with the (i + 1) th residual error component;
and determining the residual output characteristic of the Nth residual component as the global attitude characteristic corresponding to the sample object in the sample image.
Optionally, the number of the global attitude features is K, and K is a positive integer;
the global classification subunit 1111 is specifically configured to:
counting feature average values corresponding to the K global attitude features respectively, and combining the feature average values corresponding to the K global attitude features into a global feature vector;
converting the global feature vector into a feature vector to be classified according to an activation function in the initial image recognition model;
and inputting the feature vector to be classified into a classifier in the initial image recognition model, and outputting a global classification result corresponding to the feature vector to be classified through the classifier in the initial image recognition model.
The specific functional implementation manners of the global classification subunit 1111, the block processing subunit 1112, the local feature extraction subunit 1113, and the feature combination subunit 1114 may refer to step S201 in the embodiment corresponding to fig. 6, which is not described herein again.
In one or more embodiments, the loss result determining module 13 may include: a data transformation unit 131, a similarity constraint unit 132, a sample loss determination unit 133, a deformation loss determination unit 134, a classification loss determination unit 135;
a data transformation unit 131, configured to perform data transformation on the second activation map to obtain a deformed activation map;
and a similarity constraint unit 132, configured to perform similarity constraint on the first activation map and the deformation activation map, and determine a similarity loss result of the initial image recognition model.
A sample loss determining unit 133, configured to obtain a first error between the first classification result and a key point category label carried in the sample image, and determine a sample loss result of the initial image recognition model according to the first error;
a deformation loss determining unit 134, configured to obtain a second error between the second classification result and the keypoint classification label, and determine a deformation loss result of the initial image recognition model according to the second error;
and a classification loss determining unit 135, configured to determine a classification loss result of the initial image recognition model according to the sample loss result and the deformation loss result.
The specific functional implementation manners of the data transformation unit 131, the similarity constraint unit 132, the sample loss determination unit 133, the deformation loss determination unit 134, and the classification loss determination unit 135 may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.
In one or more embodiments, the parameter modification module 14 may include: a total loss determining unit 141, a network parameter adjusting unit 142;
a total loss determining unit 141, configured to determine a model total loss result corresponding to the initial image recognition model according to the similarity loss result and the classification loss result;
and a network parameter adjusting unit 142, configured to modify a network parameter of the initial image recognition model by performing minimum optimization on the total model loss result, and determine the initial image recognition model including the modified network parameter as the target image recognition model.
The specific functional implementation manners of the total loss determining unit 141 and the network parameter adjusting unit 142 may refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.
In one or more embodiments, the image data processing apparatus 1 may further include: an object component classification module 15, an object map generation module 16, and a positioning result determination module 17;
the object component classification module 15 is used for acquiring a source image, acquiring object posture characteristics corresponding to a target object in the source image through a target image recognition model, and recognizing object component classification results corresponding to the object posture characteristics; the object component classification result is used for representing the object part category corresponding to the key point of the target object;
an object map generation module 16, configured to generate an object position map according to the object component classification result and the object convolution feature of the source image;
and the positioning result determining module 17 is configured to obtain a pixel average value corresponding to the object location mapping image, determine a positioning result of a key point in the target object in the source image according to the pixel average value, and determine an attitude estimation result corresponding to the target object in the source image according to the object location type and the positioning result.
For specific functional implementation manners of the object component classification module 15, the object map generation module 16, and the positioning result determination module 17, reference may be made to steps S301 to S303 in the embodiment corresponding to fig. 9, which are not described herein again.
In one or more embodiments, subject component classification module 15 may include: a global object classification unit 151, a global map generation unit 152, a component feature acquisition unit 153, a component feature combination unit 154;
the global object classification unit 151 is configured to input a source image into a target image recognition model, obtain, in the target image recognition model, a global object feature corresponding to a target object in the source image, and output a global object classification result corresponding to the global object feature according to a classifier in the target image recognition model;
the global map generating unit 152 is configured to obtain object convolution characteristics, output by a target convolution layer in the target image recognition model, for the source image, and perform product operation on the global object classification result and the object convolution characteristics to obtain a global object map corresponding to the source image;
a component feature obtaining unit 153, configured to perform blocking processing on the source image according to the global object map to obtain M object component region images, and obtain object component features corresponding to the M object component region images according to the target image recognition model; m is a positive integer;
a part feature combining unit 154, configured to combine the global object feature and the object part features corresponding to the M object part region images into an object pose feature.
For specific functional implementation manners of the global object classifying unit 151, the global map generating unit 152, the component feature acquiring unit 153, and the component feature combining unit 154, reference may be made to step S301 in the embodiment corresponding to fig. 9, which is not described herein again.
In one or more embodiments, the image data processing apparatus may further include: an audit module 18.
And the auditing module 18 is used for determining that the auditing result of the source image in the content auditing system is an auditing passing result when the attitude estimation result is the same as the attitude of the target object in the content auditing system, and setting the access right aiming at the content auditing system for the object corresponding to the source image.
The specific function implementation manner of the auditing module 18 may refer to step S303 in the embodiment corresponding to fig. 9, which is not described herein again.
In the embodiment of the application, target posture characteristics in a sample image are extracted through an initial image recognition model, a first classification result of the target posture characteristics is obtained through classification and recognition of the target posture characteristics, and a first activation mapping chart is generated based on the first classification result and sample convolution characteristics of the sample image; meanwhile, data transformation can be carried out on the sample image to obtain a deformed image, deformation posture characteristics in the deformed image are extracted through the initial image recognition model, and a second activation mapping chart is generated according to a second classification result of the deformation posture characteristics and deformation convolution characteristics of the deformed image; furthermore, similarity constraint (namely a similarity loss result) can be applied to the first activation mapping chart and the second activation mapping chart, and therefore the positioning accuracy of key points in the images can be improved by the trained target image recognition model; in addition, when the initial image recognition model is trained, the position information of each key point of the sample object in the sample image does not need to be marked, namely the sample image is wider in acquisition source, the marking operation of the key point position of the sample image can be reduced, and the image processing efficiency can be improved; the part sensing-based block learning is introduced into the initial image recognition model, so that finer-grained features can be learned, a first activation mapping chart more sensitive to the position of the part can be further obtained, and the positioning accuracy of the model can be further improved.
Further, please refer to fig. 12, where fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 12, the computer device 1000 may be a user terminal, for example, the user terminal 10a in the embodiment corresponding to fig. 1, or may also be a server, for example, the server 10d in the embodiment corresponding to fig. 1, which is not limited herein. For convenience of understanding, in this application, taking a computer device as a user terminal as an example, the computer device 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may also include a standard wired interface or a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 12, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.
The network interface 1004 in the computer device 1000 may also provide a network communication function, and the optional user interface 1003 may also include a Display screen (Display) and a Keyboard (Keyboard). In the computer device 1000 shown in fig. 12, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing input to a user; and the processor 1001 may be configured to invoke the device control application stored in the memory 1005 to implement:
acquiring a sample image, outputting a first classification result corresponding to the sample image through an initial image recognition model, and generating a first activation mapping according to the first classification result and the sample convolution characteristic of the sample image; the first classification result is determined by target posture characteristics corresponding to the sample object in the sample image, and the first activation map is used for representing the position information of key points of the sample object in the sample image;
carrying out data transformation on the sample image to obtain a deformed image, outputting a second classification result corresponding to the deformed image through the initial image recognition model, and generating a second activation mapping chart according to the second classification result and the deformed convolution characteristic of the deformed image; the second classification result is determined by deformation posture characteristics corresponding to the sample object in the deformation image, and the second activation mapping graph is used for representing the position information of key points of the sample object in the deformation image;
determining a similarity loss result of the initial image recognition model according to the first activation mapping chart and the second activation mapping chart, and determining a classification loss result of the initial image recognition model according to the first classification result, the second classification result and the key point class label carried by the sample image;
based on the similarity loss result and the classification loss result, correcting the network parameters of the initial image recognition model to generate a target image recognition model; the target image recognition model is used for predicting the object component category and the positioning result corresponding to the key points in the source image.
It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the image data processing method in the embodiment corresponding to any one of fig. 3, fig. 6, and fig. 9, and may also perform the description of the image data processing apparatus 1 in the embodiment corresponding to fig. 7, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.
Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer program executed by the aforementioned image data processing apparatus 1 is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the image data processing method in any one of the embodiments corresponding to fig. 3, fig. 6, and fig. 9 can be executed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. As an example, the program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network, which may constitute a block chain system.
Further, it should be noted that: embodiments of the present application also provide a computer program product or computer program, which may include computer instructions, which may be stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor can execute the computer instruction, so that the computer device executes the description of the image data processing method in the embodiment corresponding to any one of fig. 3, fig. 6, and fig. 9, which will not be described again here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the computer program product or computer program embodiments referred to in the present application, reference is made to the description of the method embodiments of the present application.
It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.
The modules in the device can be merged, divided and deleted according to actual needs.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a computer, and the processes of the embodiments of the methods described above may be included in the programs. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (15)

1. An image data processing method, characterized by comprising:
obtaining a sample image, outputting a first classification result corresponding to the sample image through an initial image recognition model, and generating a first activation mapping map according to the first classification result and the sample convolution characteristics of the sample image; the first classification result is determined by target posture characteristics corresponding to a sample object in the sample image, and the first activation map is used for representing the position information of key points of the sample object in the sample image;
carrying out data transformation on the sample image to obtain a deformed image, outputting a second classification result corresponding to the deformed image through the initial image identification model, and generating a second activation mapping map according to the second classification result and the deformed convolution characteristic of the deformed image; the second classification result is determined by deformation posture characteristics corresponding to a sample object in the deformation image, and the second activation map is used for representing the position information of key points of the sample object in the deformation image;
determining a similarity loss result of the initial image recognition model according to the first activation mapping map and the second activation mapping map, and determining a classification loss result of the initial image recognition model according to the first classification result, the second classification result and a key point class label carried by the sample image;
correcting network parameters of the initial image recognition model based on the similarity loss result and the classification loss result to generate a target image recognition model; the target image recognition model is used for predicting the object component category and the positioning result corresponding to the key points in the source image.
2. The method of claim 1, wherein the outputting, by an initial image recognition model, a first classification result corresponding to the sample image, and generating a first activation map according to the first classification result and a sample convolution feature of the sample image comprises:
inputting the sample image into the initial image recognition model, and acquiring target posture characteristics corresponding to the sample object in the sample image according to the initial image recognition model;
according to a classifier in the initial image recognition model, recognizing the target posture characteristic to obtain a first classification result corresponding to the sample image;
obtaining sample convolution characteristics, output by a target convolution layer in the initial image recognition model, aiming at the sample image, and carrying out product operation on the first classification result and the sample convolution characteristics to obtain a candidate activation mapping chart corresponding to the sample image;
and performing upsampling processing on the candidate activation mapping map to obtain a first activation mapping map with the same image size as the sample image.
3. The method according to claim 2, wherein the obtaining of the target pose feature corresponding to the sample object in the sample image according to the initial image recognition model comprises:
acquiring global attitude characteristics corresponding to the sample object in the sample image in the initial image recognition model, and outputting a global classification result corresponding to the global attitude characteristics through a classifier in the initial image recognition model;
performing product operation on the global classification result and the sample convolution characteristics to obtain a global mapping image corresponding to the sample image, and performing block processing on the sample image according to the global mapping image to obtain M local area images; m is a positive integer;
sequentially inputting the M local area images to the initial image recognition model, and acquiring local attitude characteristics corresponding to the M local area images in the initial image recognition model;
and performing feature combination on the global attitude features and the local attitude features corresponding to the M local area images to obtain target attitude features corresponding to the sample object in the sample image.
4. The method of claim 3, wherein the initial image recognition model comprises N residual components, each residual component comprising one or more convolution layers, N being a positive integer;
in the initial image recognition model, obtaining a global posture feature corresponding to the sample object in the sample image includes:
acquiring input characteristics of an ith residual error component in the N residual error components; when i is 1, the input characteristic of the ith residual error component is the sample image, and i is a positive integer smaller than N;
performing convolution operation on the input characteristic of the ith residual error component according to one or more convolution layers in the ith residual error component to obtain a candidate convolution characteristic;
combining the candidate convolution characteristic with the input characteristic of the ith residual error component to obtain a residual error output characteristic of the ith residual error component, and taking the residual error output characteristic of the ith residual error component as the input characteristic of the (i + 1) th residual error component; the ith residual error component is connected with the (i + 1) th residual error component;
and determining the residual output characteristic of the Nth residual component as the global attitude characteristic corresponding to the sample object in the sample image.
5. The method of claim 3, wherein the number of global pose features is K, K being a positive integer;
outputting a global classification result corresponding to the global attitude feature through a classifier in the initial image recognition model, wherein the global classification result comprises:
counting feature average values corresponding to the K global attitude features respectively, and combining the feature average values corresponding to the K global attitude features into a global feature vector;
converting the global feature vector into a feature vector to be classified according to an activation function in the initial image recognition model;
and inputting the feature vector to be classified into a classifier in the initial image recognition model, and outputting a global classification result corresponding to the feature vector to be classified through the classifier in the initial image recognition model.
6. The method of claim 1, wherein determining a similarity loss result for the initial image recognition model based on the first activation map and the second activation map comprises:
performing the data transformation on the second activation mapping map to obtain a deformed activation mapping map;
and carrying out similarity constraint on the first activation mapping and the deformation activation mapping, and determining a similarity loss result of the initial image recognition model.
7. The method of claim 1, wherein determining the classification loss result of the initial image recognition model according to the first classification result, the second classification result and the keypoint class label carried by the sample image comprises:
obtaining a first error between the first classification result and a key point class label carried by the sample image, and determining a sample loss result of the initial image recognition model according to the first error;
acquiring a second error between the second classification result and the key point class label, and determining a deformation loss result of the initial image recognition model according to the second error;
and determining the classification loss result of the initial image recognition model according to the sample loss result and the deformation loss result.
8. The method of claim 1, wherein modifying the network parameters of the initial image recognition model based on the similarity loss result and the classification loss result to generate a target image recognition model comprises:
determining a model total loss result corresponding to the initial image recognition model according to the similarity loss result and the classification loss result;
and correcting the network parameters of the initial image recognition model by performing minimum optimization on the model total loss result, and determining the initial image recognition model containing the corrected network parameters as a target image recognition model.
9. The method of claim 1, further comprising:
acquiring a source image, acquiring object posture characteristics corresponding to a target object in the source image through a target image recognition model, and recognizing an object component classification result corresponding to the object posture characteristics; the object component classification result is used for representing the object part category corresponding to the key point of the target object;
generating an object position mapping image according to the object part classification result and the object convolution characteristics of the source image;
acquiring a pixel average value corresponding to the object position mapping image, determining a positioning result of key points in the target object in the source image according to the pixel average value, and determining an attitude estimation result corresponding to the target object in the source image according to the object position type and the positioning result.
10. The method of claim 9, wherein the obtaining of the object pose features corresponding to the target object in the source image through the target image recognition model comprises:
inputting the source image into the target image recognition model, acquiring global object features corresponding to the target object in the source image in the target image recognition model, and outputting global object classification results corresponding to the global object features according to a classifier in the target image recognition model;
obtaining object convolution characteristics, output by a target convolution layer in the target image recognition model and aiming at the source image, and carrying out product operation on the global object classification result and the object convolution characteristics to obtain a global object mapping chart corresponding to the source image;
according to the global object mapping image, the source image is subjected to blocking processing to obtain M object component area images, and object component characteristics corresponding to the M object component area images are obtained according to the target image recognition model; m is a positive integer;
and combining the global object features and the object part features corresponding to the M object part area images into the object posture features.
11. The method of claim 9, further comprising:
and when the attitude estimation result is the same as the attitude of the target object in the content auditing system, determining that the auditing result of the source image in the content auditing system is an auditing passing result, and setting access authority aiming at the content auditing system for the object corresponding to the source image.
12. An image data processing apparatus characterized by comprising:
the first generation module is used for acquiring a sample image, outputting a first classification result corresponding to the sample image through an initial image recognition model, and generating a first activation map according to the first classification result and the sample convolution characteristic of the sample image; the first classification result is determined by target posture characteristics corresponding to a sample object in the sample image, and the first activation map is used for representing the position information of key points of the sample object in the sample image;
the second generation module is used for carrying out data transformation on the sample image to obtain a deformed image, outputting a second classification result corresponding to the deformed image through the initial image identification model, and generating a second activation mapping chart according to the second classification result and the deformed convolution characteristic of the deformed image; the second classification result is determined by deformation posture characteristics corresponding to a sample object in the deformation image, and the second activation map is used for representing the position information of key points of the sample object in the deformation image;
a loss result determining module, configured to determine a similarity loss result of the initial image recognition model according to the first activation map and the second activation map, and determine a classification loss result of the initial image recognition model according to the first classification result, the second classification result, and a keypoint category label carried by the sample image;
a parameter correction module, configured to correct a network parameter of the initial image recognition model based on the similarity loss result and the classification loss result, and generate a target image recognition model; the target image recognition model is used for predicting the object component category and the positioning result corresponding to the key points in the source image.
13. A computer device comprising a memory and a processor;
the memory is coupled to the processor, the memory for storing a computer program, the processor for invoking the computer program to cause the computer device to perform the method of any of claims 1-11.
14. A computer-readable storage medium, in which a computer program is stored which is adapted to be loaded and executed by a processor to cause a computer device having said processor to carry out the method of any one of claims 1 to 11.
15. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any one of claims 1-11.
CN202111123361.1A 2021-09-24 2021-09-24 Image data processing method, apparatus, device and medium Pending CN115862054A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111123361.1A CN115862054A (en) 2021-09-24 2021-09-24 Image data processing method, apparatus, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111123361.1A CN115862054A (en) 2021-09-24 2021-09-24 Image data processing method, apparatus, device and medium

Publications (1)

Publication Number Publication Date
CN115862054A true CN115862054A (en) 2023-03-28

Family

ID=85653187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111123361.1A Pending CN115862054A (en) 2021-09-24 2021-09-24 Image data processing method, apparatus, device and medium

Country Status (1)

Country Link
CN (1) CN115862054A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116385829A (en) * 2023-04-07 2023-07-04 北京百度网讯科技有限公司 Gesture description information generation method, model training method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116385829A (en) * 2023-04-07 2023-07-04 北京百度网讯科技有限公司 Gesture description information generation method, model training method and device
CN116385829B (en) * 2023-04-07 2024-02-06 北京百度网讯科技有限公司 Gesture description information generation method, model training method and device

Similar Documents

Publication Publication Date Title
US11455495B2 (en) System and method for visual recognition using synthetic training data
US10475207B2 (en) Forecasting multiple poses based on a graphical image
WO2020103700A1 (en) Image recognition method based on micro facial expressions, apparatus and related device
WO2021052375A1 (en) Target image generation method, apparatus, server and storage medium
US20200272806A1 (en) Real-Time Tracking of Facial Features in Unconstrained Video
CN109978754A (en) Image processing method, device, storage medium and electronic equipment
CN111476097A (en) Human body posture assessment method and device, computer equipment and storage medium
CN111739027B (en) Image processing method, device, equipment and readable storage medium
CN113196289A (en) Human body action recognition method, human body action recognition system and device
CN111191599A (en) Gesture recognition method, device, equipment and storage medium
CN110490959B (en) Three-dimensional image processing method and device, virtual image generating method and electronic equipment
CN111553284A (en) Face image processing method and device, computer equipment and storage medium
CN111914676A (en) Human body tumbling detection method and device, electronic equipment and storage medium
CN114549369B (en) Data restoration method and device, computer and readable storage medium
CN111508033A (en) Camera parameter determination method, image processing method, storage medium, and electronic apparatus
CN112287730A (en) Gesture recognition method, device, system, storage medium and equipment
WO2023279799A1 (en) Object identification method and apparatus, and electronic system
CN112990154B (en) Data processing method, computer equipment and readable storage medium
CN115862054A (en) Image data processing method, apparatus, device and medium
CN115115552B (en) Image correction model training method, image correction device and computer equipment
CN112307799A (en) Gesture recognition method, device, system, storage medium and equipment
CN116580054A (en) Video data processing method, device, equipment and medium
CN114511877A (en) Behavior recognition method and device, storage medium and terminal
CN118119971A (en) Electronic device and method for determining height of person using neural network
CN113573009A (en) Video processing method, video processing device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40083068

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination