CN110929616A

CN110929616A - Human hand recognition method and device, electronic equipment and storage medium

Info

Publication number: CN110929616A
Application number: CN201911114483.7A
Authority: CN
Inventors: 张�雄
Original assignee: Reach Best Technology Co Ltd
Current assignee: Reach Best Technology Co Ltd
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2020-03-27
Anticipated expiration: 2039-11-14
Also published as: CN110929616B

Abstract

The present disclosure relates to a human hand recognition method, apparatus, electronic device and storage medium, the method comprising: performing feature extraction on an image to be detected through a feature extractor of the human hand recognition network model to obtain image features; processing the image characteristics through a multi-task branch network layer to obtain a first edge characteristic diagram, a first region characteristic diagram and a first key point characteristic diagram; regressing the addition result of the first edge characteristic diagram, the first region characteristic diagram and the first key point characteristic diagram through a regression layer to obtain a first posture parameter representing the posture of the hand in the image to be detected and a first shape parameter representing the shape of the hand in the image to be detected; and generating a three-dimensional model of the human hand in the image to be detected through the MANO network based on the first posture parameter and the first shape parameter. By adopting the method and the device, the hand edge, the hand area and the two-dimensional hand key point are identified through one network model, and the three-dimensional model of the hand is obtained, so that the hand identification efficiency can be improved.

Description

Human hand recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a human hand recognition method, an apparatus, an electronic device, and a storage medium.

Background

With the development of internet technology, the application of recognizing human hands in images is more and more extensive, for example, the application is as follows: virtual/augmented reality, human-computer interaction, action recognition, driving assistance and the like.

Identifying a human hand in an image includes a plurality of identification tasks including identifying a human hand edge in the image, identifying a human hand region in the image, identifying a two-dimensional human hand keypoint in the image, and three-dimensional modeling of the human hand in the image. At present, in order to complete tasks of recognizing edges of human hands, recognizing regions of human hands, recognizing key points of two-dimensional human hands and performing three-dimensional modeling on human hands in images in the prior art, a network model is generally required to be established for each recognition task separately for recognition. For example, the human hand edge in the image is recognized through a human hand edge recognition network model, the human hand region in the image is recognized through a human hand region recognition network model, the two-dimensional human hand key point in the image is recognized through a two-dimensional human hand key point recognition network model, and the three-dimensional model of the human hand in the image is generated through a three-dimensional reconstruction network model.

Therefore, in the related art, a plurality of network models need to be constructed when the human hand recognition is performed, so that the human hand recognition efficiency is low.

Disclosure of Invention

The invention provides a hand recognition method, a hand recognition device, electronic equipment and a storage medium, wherein the hand edge, the hand area and the two-dimensional hand key point in an image are recognized through a network model, and a three-dimensional model of the hand in the image is generated, so that the hand recognition efficiency can be improved. The technical scheme of the disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided a human hand recognition method, including:

inputting an image to be detected containing a human hand into a pre-trained human hand recognition network model, wherein the human hand recognition network model comprises a feature extractor, a multi-task branch network layer, a regression layer and a MANO network;

performing feature extraction on the image to be detected through the feature extractor to obtain image features of the image to be detected;

processing the image features through the multitask branch network layer to obtain a first edge feature graph representing the edge of a human hand in the image to be detected, a first region feature graph representing a human hand region in the image to be detected and a first key point feature graph representing two-dimensional human hand key points in the image to be detected;

regressing the addition result of the first edge feature map, the first region feature map and the first key point feature map through the regression layer to obtain a first posture parameter representing the posture of the human hand in the image to be detected and a first shape parameter representing the shape of the human hand in the image to be detected;

and generating a three-dimensional model of the human hand in the image to be detected through the MANO network based on the first posture parameter and the first shape parameter.

Optionally, the multitasking branch network layer includes an encoder, an edge decoder, a region decoder, and a heatmap decoder;

the processing the image features through the multi-task branch network layer to obtain a first edge feature map representing the edge of a human hand in the image to be detected, a first region feature map representing the region of the human hand in the image to be detected, and a first key point feature map representing two-dimensional human hand key points in the image to be detected, includes:

encoding the image characteristics through the encoder to obtain high-level image semantic information of the image to be detected;

decoding the semantic information of the high-level image through the edge decoder to obtain a first edge feature map representing the edge of a human hand in the image to be detected; decoding the semantic information of the high-level image through the region decoder to obtain a first region feature map representing a hand region in the image to be detected; and decoding the semantic information of the high-level image through the heat map decoder to obtain a first key point feature map representing key points of a two-dimensional human hand in the image to be detected.

Optionally, the human hand recognition network model further includes a differential rendering layer;

after the image features are processed through the multitask branch network layer to obtain a first edge feature map representing the edge of a human hand in the image to be detected, a first region feature map representing a human hand region in the image to be detected, and a first key point feature map representing two-dimensional human hand key points in the image to be detected, the method further includes:

regressing the addition result of the first edge feature map, the first region feature map and the first key point feature map through the regression layer to obtain a first camera parameter;

based on the first camera parameter, projecting a three-dimensional model of a human hand in the image to be detected through the differential rendering layer to obtain first human hand projection information, wherein the first human hand projection information comprises at least one of the following information: the human hand region projected by the image to be detected, the two-dimensional human hand key points projected by the image to be detected and the three-dimensional human hand key points projected by the image to be detected.

the training step of the human hand recognition network model comprises the following steps:

inputting a sample image containing a human hand into an initial human hand recognition network model to obtain a second edge feature map representing the edge of the human hand in the sample image, a second area feature map representing the area of the human hand in the sample image and a second key point feature map representing two-dimensional human hand key points in the sample image; the sample image is provided with an annotated human hand area, two-dimensional human hand key points and three-dimensional human hand key points;

regressing the addition result of the second edge feature map, the second region feature map and the second key point feature map through the regression layer to obtain a second camera parameter, a second posture parameter representing the posture of the human hand in the sample image and a second shape parameter representing the shape of the human hand in the sample image;

generating a three-dimensional model of the human hand in the sample image as a sample three-dimensional model through the MANO network based on the second pose parameter and the second shape parameter;

based on the second camera parameter, projecting the sample three-dimensional model through the differential rendering layer to obtain second human hand projection information, wherein the second human hand projection information includes at least one of the following information: the human hand area projected by the sample image, the two-dimensional human hand key point projected by the sample image and the three-dimensional human hand key point projected by the sample image;

training model parameters of the human hand recognition network model according to the difference between the second human hand projection information and human hand information corresponding to the labeled sample image;

and when the hand recognition network model converges, obtaining the trained hand recognition network model.

Optionally, after the sample image including the human hand is input to the initial human hand recognition network model, and a second edge feature map representing edges of the human hand in the sample image, a second region feature map representing regions of the human hand in the sample image, and a second keypoint feature map representing keypoints of the two-dimensional human hand in the sample image are obtained, the method further includes:

predicting a human hand region in the sample image based on the second region feature map;

predicting two-dimensional human hand key points in the sample image based on the second key point feature map;

the training of the model parameters of the human hand recognition network model according to the difference between the second human hand projection information and the human hand information corresponding to the labeled sample image comprises the following steps:

and training model parameters of the human hand recognition network model by combining the difference between the second human hand projection information and the human hand information corresponding to the labeled sample image and the difference between the predicted human hand information and the human hand information corresponding to the labeled sample image, wherein the predicted human hand information comprises a predicted human hand area in the sample image and/or a predicted two-dimensional human hand key point in the sample image.

According to a second aspect of the embodiments of the present disclosure, there is provided a human hand recognition device including:

the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is configured to input an image to be detected containing a human hand into a human hand recognition network model which is trained in advance, and the human hand recognition network model comprises a feature extractor, a multi-task branch network layer, a regression layer and a MANO network;

the extraction module is configured to perform feature extraction on the image to be detected through the feature extractor to obtain image features of the image to be detected;

the second processing module is configured to execute processing on the image features through the multitask branch network layer to obtain a first edge feature map representing edges of human hands in the image to be detected, a first area feature map representing a human hand area in the image to be detected, and a first key point feature map representing two-dimensional human hand key points in the image to be detected;

the regression module is configured to perform regression on an addition result of the first edge feature map, the first region feature map and the first key point feature map through the regression layer to obtain a first posture parameter representing a posture of a human hand in the image to be detected and a first shape parameter representing a shape of the human hand in the image to be detected;

a generating module configured to perform generating a three-dimensional model of the human hand in the image to be detected through the MANO network based on the first pose parameter and the first shape parameter.

the second processing module is specifically configured to perform encoding on the image features through the encoder to obtain high-level image semantic information of the image to be detected;

the device further comprises:

a third processing module configured to perform regression on an addition result of the first edge feature map, the first region feature map, and the first keypoint feature map through the regression layer to obtain a first camera parameter;

the device further comprises:

the training module is configured to input a sample image containing a human hand into an initial human hand recognition network model to obtain a second edge feature map representing the edge of the human hand in the sample image, a second region feature map representing a human hand region in the sample image and a second key point feature map representing two-dimensional human hand key points in the sample image; the sample image is provided with an annotated human hand area, two-dimensional human hand key points and three-dimensional human hand key points;

Optionally, the apparatus further comprises:

a prediction module configured to perform prediction of a human hand region in the sample image based on the second region feature map;

the training module is specifically configured to perform training on model parameters of the human hand recognition network model by combining a difference between the second human hand projection information and human hand information corresponding to the labeled sample image and a difference between predicted human hand information and human hand information corresponding to the labeled sample image, wherein the predicted human hand information includes a predicted human hand region in the sample image and/or a predicted two-dimensional human hand key point e in the sample image.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the human hand recognition method as described above in the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the human hand recognition method as described in the first aspect above.

According to a fifth aspect of embodiments of the present application, there is provided a computer program product, wherein the instructions of the computer program product, when executed by a processor of an electronic device, enable the electronic device to perform the human hand recognition method as described in the first aspect above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the method comprises the steps of extracting features of an image to be detected through a feature extractor of a human hand recognition network model to obtain image features, processing the image features through a multi-task branch network layer to obtain a first edge feature map, a first region feature map and a first key point feature map, regressing an addition result of the first edge feature map, the first region feature map and the first key point feature map through a regression layer to obtain a first posture parameter representing the posture of a human hand in the image to be detected and a first shape parameter representing the shape of the human hand in the image to be detected, and then generating a three-dimensional model of the human hand in the image to be detected through an MANO network based on the first posture parameter and the first shape parameter.

Based on the above processing, the human hand edge, the human hand region, and the two-dimensional human hand key point in the image can be recognized through one network model (i.e., the human hand recognition network model in the embodiment of the present disclosure), and the three-dimensional model of the human hand in the image is generated, so that the human hand recognition efficiency can be improved. In addition, the multi-task branch network layer can fully utilize the marked information of the image, the generalization performance of the human hand recognition network model and the accuracy of the human hand recognition result are improved, and the human hand recognition network model comprises the MANO network, so that the ambiguity problem can be avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow chart illustrating a method of human hand recognition according to an exemplary embodiment.

Fig. 2 is a schematic diagram illustrating a structure of a human hand recognition network model according to an exemplary embodiment.

FIG. 3 is a flow diagram illustrating a method of training a human hand recognition network model in accordance with an exemplary embodiment.

FIG. 4 is a block diagram illustrating a human hand recognition device in accordance with one exemplary embodiment.

FIG. 5 is a block diagram illustrating an electronic device for recognizing a human hand in accordance with an exemplary embodiment

FIG. 6 is a block diagram illustrating an electronic device for recognizing a human hand in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a human hand recognition method according to an exemplary embodiment, and as shown in fig. 1, the human hand recognition method may be applied to an electronic device, and the electronic device may be a terminal (e.g., a mobile phone, a computer, or a tablet computer) or a server. The method may comprise the steps of:

in step S101, an image to be detected including a human hand is input to a human hand recognition network model trained in advance.

The human hand recognition network model can comprise a feature extractor, a multi-task branch network layer, a regression layer and a MANO network. The image to be detected may be an RGB (Red, Green, Blue, Red, Green, Blue) image.

In step S102, feature extraction is performed on the image to be detected by the feature extractor, so as to obtain image features of the image to be detected.

Wherein the feature extractor may be comprised of convolutional layers.

In one embodiment, the electronic device may perform a convolution operation on the image to be detected by using a feature extractor composed of convolution layers to extract an image feature of the image to be detected, where the image feature of the image to be detected may be a feature map with a smaller size, so as to reduce the computation amount of the network model.

In step S103, the image features are processed through the multitask branch network layer, so as to obtain a first edge feature map representing edges of a human hand in the image to be detected, a first region feature map representing a human hand region in the image to be detected, and a first key point feature map representing two-dimensional human hand key points in the image to be detected.

In an embodiment, the multitask branch network layer may include a plurality of network layers, and the electronic device may process an image feature process of an image to be detected based on the plurality of network layers, respectively, to obtain a feature map representing edges of a human hand in the image to be detected (i.e., a first edge feature map in the embodiment of the present disclosure), a feature map representing a region of the human hand in the image to be detected (i.e., a first region feature map in the embodiment of the present disclosure), and a feature map representing two-dimensional human hand key points in the image to be detected (i.e., a first key point feature map in the embodiment of the present disclosure).

Optionally, the multi-tasking branching network layer includes an encoder, an Edge Decoder (Edge Decoder), a region Decoder (Mask Decoder), and a Heat map Decoder (Heat-map Decoder), and S103 may include the following steps:

the method comprises the steps that firstly, image features are coded through a coder, and high-level image semantic information of an image to be detected is obtained.

Among them, the Encoder (Encoder) and the decoder (decoder) are a standard practice in the field of deep learning, and the Encoder is used for extracting high-level image semantic information of an image.

Decoding the semantic information of the high-level image through an edge decoder to obtain a first edge feature map representing the edge of a human hand in the image to be detected; decoding the semantic information of the high-level image through a region decoder to obtain a first region characteristic diagram representing a hand region in the image to be detected; and decoding the semantic information of the high-level image through a heat map decoder to obtain a first key point feature map representing two-dimensional human hand key points in the image to be detected.

In one embodiment, the edge decoder may decode the high-level image semantic information of the image to be detected, to obtain an edge feature map with a size of 256 × 256 for predicting the edge of the human hand in the image to be detected. The region decoder can decode the high-level image semantic information of the image to be detected to obtain a region feature map with the size of 256 × 256 for predicting the hand region in the image to be detected. The heat map decoder can decode the high-level image semantic information of the image to be detected to obtain a plurality of key point feature maps with the size of 256 multiplied by 256 and used for predicting two-dimensional human hand key points in the image to be detected.

In step S104, the addition result of the first edge feature map, the first region feature map, and the first keypoint feature map is regressed by the regression layer to obtain a first posture parameter representing the posture of the human hand in the image to be detected and a first shape parameter representing the shape of the human hand in the image to be detected.

Wherein, the regression layer can be composed of a convolution layer and a full connecting layer.

In one embodiment, the electronic device may superimpose the first edge feature map, the first region feature map, and the first keypoint feature map, and then regress the superimposed result through a regression layer composed of a convolution layer and a full-link layer, to obtain a parameter (i.e., a first posture parameter in the embodiment of the present disclosure) for representing a posture of a human hand in the image to be detected and a parameter (i.e., a first shape parameter in the embodiment of the present disclosure) for representing a shape of the human hand in the image to be detected.

In step S105, a three-dimensional model of the human hand in the image to be detected is generated through the MANO network based on the first posture parameter and the first shape parameter.

The MANO network is a parameterized model of a human hand provided by a Max platform Perceiving System (maximum Planck perception System), and can generate a three-dimensional model of the human hand according to parameters of the posture of the human hand and parameters of the shape of the human hand.

According to the human hand recognition method provided by the embodiment of the disclosure, the human hand edge, the human hand area and the two-dimensional human hand key point in the image can be recognized only through one network model (namely, the human hand recognition network model), and the three-dimensional model of the human hand in the image is generated, so that the human hand recognition efficiency can be improved. In addition, due to the adoption of the multi-task branch network layer, the marked information of the image can be fully utilized, the generalization performance of the human hand recognition network model can be improved, correspondingly, the accuracy of the human hand recognition result can also be improved, and the problem that the ambiguity problem which cannot be avoided by the traditional method, namely the problem that the part of the human hand which is shielded in the image cannot be accurately mapped to the three-dimensional space can be solved due to the adoption of the MANO network to generate the three-dimensional model of the human hand.

Optionally, the network model for human hand recognition further includes a differential rendering layer, and after S103, the method may further include the following steps:

step one, the addition result of the first edge feature map, the first region feature map and the first key point feature map is regressed through a regression layer to obtain a first camera parameter.

In one embodiment, when the electronic device regresses the superposition result of the first edge feature map, the first region feature map, and the first keypoint feature map, corresponding camera parameters (i.e., the first camera parameters in the embodiment of the present disclosure) may also be obtained.

And secondly, projecting the three-dimensional model of the hand in the image to be detected through a differential rendering layer based on the first camera parameter to obtain first hand projection information.

Wherein the first person hand projection information comprises at least one of: the human hand detection method comprises the steps of projecting a human hand region of an image to be detected, projecting two-dimensional human hand key points of the image to be detected and projecting three-dimensional human hand key points of the image to be detected.

In one embodiment, after obtaining the first camera parameter, the electronic device may project, through the differential rendering layer, the three-dimensional model of the human hand in the image to be detected based on the first camera parameter.

According to actual needs, the electronic equipment can be used for obtaining any information or any information combination of a human hand region in the image to be detected, a two-dimensional human hand key point in the image to be detected and a three-dimensional human hand key point in the image to be detected through projection.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating a structure of a human hand recognition network model according to an exemplary embodiment.

In fig. 2, the human hand recognition network model includes a feature extractor, a multi-task branching network layer, a regression layer, a MANO network, and a differential rendering layer.

The multitasking branch network layer may include an encoder, an edge decoder, a region decoder, and a heatmap decoder.

Theta mesh output by the regression layer represents a posture parameter and a shape parameter, and theta cam represents a camera parameter.

Optionally, referring to fig. 3, fig. 3 is a flowchart illustrating a method for training a human hand recognition network model according to an exemplary embodiment, where the method may include the following steps:

in step S301, a sample image including a human hand is input to the initial human hand recognition network model, and a second edge feature map representing edges of the human hand in the sample image, a second region feature map representing regions of the human hand in the sample image, and a second keypoint feature map representing keypoints of the two-dimensional human hand in the sample image are obtained.

The sample image has an annotated human hand region, two-dimensional human hand key points, and three-dimensional human hand key points.

In one embodiment, the electronic device may obtain an initial human hand recognition network model shown in fig. 2, input a sample image labeled with a human hand region, a two-dimensional human hand key point, and a three-dimensional human hand key point into the human hand recognition network model, and obtain, through the feature extractor and the multitask branch network layer, a feature map representing edges of human hands in the sample image (i.e., the second edge feature map in the present disclosure), a feature map representing the human hand region in the sample image (i.e., the second region feature map in the present disclosure), and a feature map representing the two-dimensional human hand key point in the sample image (i.e., the second key point feature map in the present disclosure).

In step S302, the addition result of the second edge feature map, the second region feature map, and the second keypoint feature map is regressed by the regression layer to obtain a second camera parameter, a second posture parameter representing the posture of the human hand in the sample image, and a second shape parameter representing the shape of the human hand in the sample image.

In one embodiment, the electronic device may superimpose the second edge feature map, the second region feature map, and the second keypoint feature map, and then regress the superimposed result through a regression layer composed of a convolution layer and a full-link layer, so as to obtain a parameter (i.e., a second pose parameter in the embodiment of the present disclosure) for representing the pose of the human hand in the sample image, a parameter (i.e., a second shape parameter in the embodiment of the present disclosure) for representing the shape of the human hand in the sample image, and a camera parameter (i.e., a second camera parameter in the embodiment of the present disclosure).

In step S303, a three-dimensional model of the human hand in the sample image is generated as a sample three-dimensional model through the MANO network based on the second posture parameter and the second shape parameter.

The method for generating the three-dimensional model of the sample is similar to the method for generating the three-dimensional model of the human hand in the image to be detected in the embodiment, and is not repeated.

In step S304, based on the second camera parameter, the sample three-dimensional model is projected through the differential rendering layer, so as to obtain second hand projection information.

Wherein the second hand projection information comprises at least one of: the human hand area projected by the sample image, the two-dimensional human hand key point projected by the sample image and the three-dimensional human hand key point projected by the sample image.

The method for generating the projection information of the second person is similar to the method for generating the projection information of the first person in the foregoing embodiment, and is not described again.

In step S305, training model parameters of the human hand recognition network model according to a difference between the second human hand projection information and human hand information corresponding to the labeled sample image.

The difference between the human hand region obtained by projecting the sample three-dimensional model and the human hand region labeled by the sample image can be referred to as a first difference. The difference between the two-dimensional human hand key point obtained by projecting the sample three-dimensional model and the two-dimensional human hand key point labeled by the sample image can be called as a second difference. The difference between the three-dimensional human hand key point obtained by projecting the sample three-dimensional model and the three-dimensional human hand key point labeled by the sample image can be called as a third difference.

In one embodiment, the electronic device may train the model parameters of the human hand recognition network model according to any one of the first difference, the second difference and the third difference, or according to a combination of any differences.

In step S306, when the human hand recognition network model converges, a trained human hand recognition network model is obtained.

Optionally, in order to further improve the accuracy of the network model for human hand recognition, the method may further include the following steps: and predicting the human hand region in the sample image based on the second region feature map, and predicting the two-dimensional human hand key points in the sample image based on the second key point feature map.

In one embodiment, after obtaining the second region feature map and the second key point feature map of the sample image through the multi-task branch network layer, the electronic device may predict a human hand region in the sample image based on the second region feature map, and predict a two-dimensional human hand key point in the sample image based on the second key point feature map.

The second region feature map and the second keypoint feature map may include a plurality of feature values, each feature value representing a feature of a respective position in the sample image.

In one embodiment, the electronic device may determine a first feature value capable of characterizing a human hand region in the second region feature map, and then, the first feature value is corresponding to a region formed by positions in the sample image, as the human hand region in the sample image.

For example, the electronic device may determine a feature value greater than a first preset threshold in the second region feature map as the first feature value.

In addition, the electronic device may determine a second feature value capable of characterizing the two-dimensional human hand key point in the second key point feature map, and then correspond the second feature value to a position in the sample image as the two-dimensional human hand key point in the sample image.

For example, the electronic device may determine, as the second feature value, a feature value in the second keypoint feature map that is greater than a second preset threshold.

The number of the second key point feature maps can be multiple, the number of the two-dimensional human hand key points can be 21, and gestures can be determined through the 21 two-dimensional human hand key points.

Therefore, the number of the second key point feature maps can be 21, and the electronic device determines a two-dimensional human hand key point according to one second key point feature map, and further can determine 21 two-dimensional human hand key points.

Correspondingly, the electronic equipment can train the model parameters of the human hand recognition network model by combining the difference between the second human hand projection information and the human hand information corresponding to the labeled sample image and the difference between the predicted human hand information and the human hand information corresponding to the labeled sample image.

Wherein the predicted human hand information comprises a predicted human hand region in the sample image and/or a predicted two-dimensional human hand keypoint in the sample image.

The difference between the predicted human hand region in the sample image and the human hand region labeled with the sample image may be referred to as a fourth difference. The difference between the predicted two-dimensional human hand keypoints in the sample image and the two-dimensional human hand keypoints labeled in the sample image may be referred to as a fifth difference.

In one embodiment, the electronic device may obtain the fourth difference and/or the fifth difference, obtain any one difference of the first difference, the second difference, and the third difference, or a combination of any difference, and train the model parameters of the human hand recognition network model according to the obtained differences.

In addition, the second edge feature map may include a plurality of feature values, each feature value representing a feature at a respective location in the sample image.

The electronic device can determine a third feature value capable of characterizing the edge of the human hand in the second edge feature map, and then correspond the third feature value to a position in the sample image as the edge of the human hand in the sample image.

For example, the electronic device may determine, as the third feature value, a feature value in the second edge feature map that is greater than a third preset threshold.

FIG. 4 is a block diagram illustrating a human hand recognition device in accordance with one exemplary embodiment. Referring to fig. 4, the apparatus includes a first processing module 401, an extraction module 402, a second processing module 403, a regression module 404, and a generation module 405.

A first processing module 401 configured to input an image to be detected including a human hand to a pre-trained human hand recognition network model, wherein the human hand recognition network model includes a feature extractor, a multi-task branch network layer, a regression layer, and a MANO network;

an extraction module 402, configured to perform feature extraction on the image to be detected through the feature extractor, so as to obtain image features of the image to be detected;

a second processing module 403, configured to perform processing on the image features through the multi-task branch network layer, so as to obtain a first edge feature map representing edges of a human hand in the image to be detected, a first region feature map representing a human hand region in the image to be detected, and a first key point feature map representing two-dimensional human hand key points in the image to be detected;

a regression module 404 configured to perform regression on an addition result of the first edge feature map, the first region feature map, and the first keypoint feature map through the regression layer to obtain a first posture parameter representing a posture of a human hand in the image to be detected and a first shape parameter representing a shape of the human hand in the image to be detected;

a generating module 405 configured to perform generating a three-dimensional model of the human hand in the image to be detected through the MANO network based on the first pose parameter and the first shape parameter.

the second processing module 403 is specifically configured to perform encoding on the image features through the encoder to obtain high-level image semantic information of the image to be detected;

the device further comprises:

Optionally, the apparatus further comprises:

the training module is specifically configured to perform training of model parameters of the human hand recognition network model by combining a difference between the second human hand projection information and human hand information corresponding to the labeled sample image and a difference between predicted human hand information and human hand information corresponding to the labeled sample image, wherein the predicted human hand information includes a predicted human hand region in the sample image and/or a predicted two-dimensional human hand key point in the sample image.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 5 is a block diagram illustrating an electronic device 500 for recognizing a human hand in accordance with an exemplary embodiment. For example, the electronic device 500 may be provided as a server. Referring to fig. 5, electronic device 500 includes a processing component 522 that further includes one or more processors and memory resources, represented by memory 532, for storing instructions, such as applications, that are executable by processing component 522. The application programs stored in memory 532 may include one or more modules that each correspond to a set of instructions. Further, the processing component 522 is configured to execute instructions to perform the human hand recognition method described above.

The electronic device 500 may also include a power component 526 configured to perform power management of the apparatus 500, a wired or wireless network interface 550 configured to connect the apparatus 500 to a network, and an input/output (I/O) interface 558. The electronic device 500 may operate based on an operating system stored in the memory 532, such as a Windows Server, MacOS XTM, UnixTM, LinuxTM, FreeBSDTM, or similar operating system.

FIG. 6 is a block diagram illustrating an electronic device for recognizing a human hand in accordance with an exemplary embodiment. For example, the electronic device may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, the electronic device may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an input/output (I/O) interface 612, a sensor component 614, and a communication component 616.

The processing component 602 generally controls overall operation of the electronic device, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or part of the steps of the human hand recognition method described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operations at the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 606 provides power to the various components of the electronic device. The power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for an electronic device.

The multimedia component 608 includes a screen that provides an output interface between the electronic device and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing various aspects of status assessment for the electronic device. For example, the sensor component 614 may detect an open/closed state of the electronic device, the relative positioning of components, such as a display and keypad of the electronic device, the sensor component 614 may also detect a change in the position of the electronic device or a component of the electronic device, the presence or absence of user contact with the electronic device, orientation or acceleration/deceleration of the electronic device, and a change in the temperature of the electronic device. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the electronic device to perform the human hand identification method described above is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A human hand recognition method, comprising:

2. The human hand identification method of claim 1, wherein the multitasking branch network layer comprises an encoder, an edge decoder, a region decoder and a heatmap decoder;

3. The human hand recognition method of claim 1, wherein the human hand recognition network model further comprises a differential rendering layer;

4. The human hand recognition method of claim 1, wherein the human hand recognition network model further comprises a differential rendering layer;

5. The human hand recognition method according to claim 4, wherein after the inputting of the sample image containing the human hand into the initial human hand recognition network model, the second edge feature map representing edges of the human hand in the sample image, the second region feature map representing regions of the human hand in the sample image, and the second keypoint feature map representing keypoints of the two-dimensional human hand in the sample image are obtained, the method further comprises:

6. A human hand recognition device, comprising:

7. The human recognition device of claim 6, wherein the multitasking branch network layer includes an encoder, an edge decoder, an area decoder and a heatmap decoder;

8. A human hand recognition device according to claim 6, wherein the human hand recognition network model further comprises a differential rendering layer;

the device further comprises:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the human hand recognition method of any one of claims 1 to 5.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the human hand recognition method of any one of claims 1 to 5.