CN111523387A

CN111523387A - Method and device for detecting hand key points and computer device

Info

Publication number: CN111523387A
Application number: CN202010211811.1A
Authority: CN
Inventors: 林健; 周志敏; 刘海伟; 丛林
Original assignee: Hangzhou Yixian Advanced Technology Co ltd
Current assignee: Hangzhou Yixian Advanced Technology Co ltd
Priority date: 2020-03-24
Filing date: 2020-03-24
Publication date: 2020-08-11
Anticipated expiration: 2040-03-24
Also published as: CN111523387B

Abstract

The application relates to a method, equipment and computer equipment for detecting key points of a hand, wherein the method for detecting the key points comprises the following steps: the method comprises the steps of obtaining an image set, generating a mask of the key points of the hand according to real position labels of the key points of the hand in the image set, wherein the key points of the hand are classified into fingertip points and palm center points, analyzing key point identification information of the key points of the hand according to a feature map of the image set, obtaining a predicted position label value of the key points of the hand according to the mask and the key point identification information, performing loss function regression on the key points of the hand according to the predicted position label value and the real position labels to obtain a key point detection model of the hand, and performing key point detection on the key points of the hand through the key point detection model of the hand.

Description

Method and device for detecting hand key points and computer device

Technical Field

The present application relates to the field of computer vision, and in particular, to a method, device, and computer device for detecting key points of a hand.

Background

With the development of human-computer interaction technology, human-computer interaction modes such as a keyboard, a mouse, a touch screen and the like are difficult to meet the requirements of users in many emerging fields, and users are more inclined to use wireless and non-contact modes to realize human-computer interaction in remote control scenes such as Augmented Reality (AR). On the other hand, the interaction technology based on the computer vision technology can enable a user to get rid of complicated interaction equipment, and send instructions to a machine by using specific body actions, so that the computer vision technology is convenient and fast, and the development of the computer vision technology meets the requirement that the user realizes man-machine interaction in a wireless and non-contact mode.

In the related art, the posture estimation of the key points is realized by screening and sequentially connecting the key points according to a preset sequence, the method for predicting and aggregating the key points depends on the ordered arrangement of the key points, and a finger can be identified to be a thumb, an index finger or a middle finger and the like.

At present, an effective solution is not provided aiming at the problem of higher cost of a method for estimating the attitude depending on orderly arranged key points in the related technology.

Disclosure of Invention

The embodiment of the application provides a method, equipment, computer equipment and a computer readable storage medium for detecting key points of a hand, and aims to at least solve the problem of higher cost of a method for estimating a posture by relying on orderly arranged key points in the related art.

In a first aspect, an embodiment of the present application provides a method for detecting a hand keypoint, where the method includes:

acquiring an image set, and generating a mask of the key points of the hand according to the real position labels of the key points of the hand in the image set, wherein the key points of the hand are classified into a fingertip point and a palm center point;

analyzing key point identification information of the hand key points according to the feature map of the image set, and obtaining predicted position label values of the hand key points according to the mask and the key point identification information;

and performing loss function regression on the hand key points according to the predicted position label value and the real position label to obtain a hand key point detection model, and performing hand key point detection through the hand key point detection model.

In some embodiments, the obtaining the predicted position tag value of the hand keypoint comprises:

and adjusting the predicted position label value according to the absolute position information of the key point of the hand.

In some of these embodiments, obtaining absolute position information for the hand keypoints comprises:

the absolute position information is derived from a grid map, wherein the grid map provides grid point coordinates.

and obtaining the absolute position information from an offset map, wherein the predicted position tag value is adjusted according to the product of the offset map and the predicted position tag value.

In some embodiments, the performing of hand keypoint detection by the hand keypoint detection model comprises:

acquiring palm center points and fingertip points of the image to be detected through the hand key point model;

and aggregating the palm center point and the fingertip point according to the label values of the palm center point and the fingertip point by taking the palm center point as a reference, wherein the palm center point corresponds to the label value of the fingertip point.

In some embodiments, the generating a mask of the hand keypoints according to the true position labels of the hand keypoints in the image set comprises:

determining mask parameters according to the number of hands in the image set, and distinguishing the foreground and the background of each hand according to the real position label of the key point of the hand to obtain a hand mask;

and obtaining the mask of the key points of the hand of the image set according to the mask parameter and the hand mask.

In some of these embodiments, said performing a regression of the loss function on said hand keypoints comprises:

screening the predicted position tag value according to the palm center point information in the mask, and reserving image data including hands;

and performing loss function regression on the image data according to the average value of the predicted position label values of the hand and the average value of the key point label values.

In a second aspect, an embodiment of the present application provides a device for detecting a hand keypoint, where the device includes a generation module, a prediction module, and a regression module:

the generating module is used for acquiring an image set, and generating a mask of the key points of the hand according to the real position labels of the key points of the hand in the image set, wherein the key points of the hand are classified into a tip point and a palm center point;

the prediction module is used for analyzing the key point identification information of the hand key point according to the feature map of the image set and obtaining a predicted position label value of the hand key point according to the mask and the key point identification information;

and the regression module is used for performing loss function regression on the hand key points according to the predicted position label value and the real position label to obtain a hand key point detection model, and performing hand key point detection through the hand key point detection model.

In some of these embodiments, the prediction module comprises an adjustment unit:

the adjusting unit is used for adjusting the predicted position label value according to the absolute position information of the hand key point.

In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the method described in any one of the above.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method as described in any one of the above.

Compared with the related art, the method for detecting the hand key points provided by the embodiment of the application generates the mask of the hand key points according to the real position labels of the hand key points in the image set by acquiring the image set, wherein the hand key points are classified into the finger tip points and the palm center points, analyzes the key point identification information of the hand key points according to the feature map of the image set, obtains the predicted position label values of the hand key points according to the mask and the key point identification information, performs loss function regression on the hand key points according to the predicted position label values and the real position labels to obtain a hand key point detection model, performs hand key point detection through the hand key point detection model, and solves the problems of high cost and the like of a method for performing posture estimation depending on the orderly arranged key points in the related art, the scene adaptability of the hand key point detection is improved, and the cost of the hand key point detection is reduced.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of an application environment of a method for hand keypoint detection according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of hand keypoint detection according to an embodiment of the present application;

FIG. 3 is a schematic diagram of grid point coordinates in the x-direction according to an embodiment of the present application;

FIG. 4 is a grid point coordinate diagram in the y-direction according to an embodiment of the present application;

FIG. 5 is an offset graph according to an embodiment of the present application;

FIG. 6 is a flow diagram of another method of hand keypoint detection according to an embodiment of the present application;

FIG. 7 is a flow diagram of a method of generating a hand keypoint mask according to an embodiment of the present application;

FIG. 8 is a flow diagram of a method of regression of a loss function according to an embodiment of the present application;

FIG. 9 is a block diagram of a device for hand keypoint detection according to an embodiment of the present application;

FIG. 10 is a block diagram of another apparatus for hand keypoint detection according to an embodiment of the present application;

fig. 11 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method for detecting hand key points provided by the present application can be applied to the application environment shown in fig. 1, and fig. 1 is an application environment schematic diagram of the method for detecting hand key points according to the embodiment of the present application, as shown in fig. 1. The camera terminal 102 obtains an image set of the hand 104, the position tracker 108 tracks the hand 104 in the image set and generates a real position label of a key point of the hand 104, the server 106 generates a mask of the key point of the hand according to the real position label, wherein the key point of the hand is classified into a fingertip point and a palm center point, the server 106 analyzes key point identification information of the key point of the hand according to a feature map of the image set and obtains a predicted position label value of the key point of the hand according to the mask and the key point identification information, a loss function regression is performed on the key point of the hand according to the predicted position label value and the real position label to obtain a hand key point detection model, and the hand key point detection is performed through the hand key point detection model. The characteristic diagram is generated through a neural network, a framework of the neural network includes a pitorch, a tensorflow, a keras, and the like, the camera terminal 102 may be a depth camera, and the server 104 may be implemented by an independent server or a server cluster composed of a plurality of servers.

The embodiment provides a method for detecting key points of a hand. Fig. 2 is a flowchart of a method of hand keypoint detection according to an embodiment of the application, which, as shown in fig. 2, comprises the following steps:

step S201, acquiring an image set, and generating a mask of a hand keypoint according to a real position label of the hand keypoint in the image set, wherein the hand keypoint is a visual keypoint of which a hand is not occluded, the visual keypoint is classified into a finger cusp and a palm center point, the image set is used as an input image and may include a plurality of training images, an input scale of the input image is batchsize H W C, the batchsize is the number of training images in the image set, the batch represents the image set, the larger the number of training images is, the larger the value of the batchsize is, H, W and C are used to describe the size of the input image, H is the height of the input image, W is the width of the input image, and C is the number of channels of the input image, for example, when the scale of the input image is 32 480 640, the size of the training image is 480, the number of channels is 1, indicating that the image set is a single-channel depth image. In other embodiments, the number of channels may also be 3, and the representation image is an RGB three-channel image, where R represents red, G represents green, and B represents blue. In the process of model training, the method only classifies the fingertip points and the palm center points of the hand key points, and extracts interested fingertip positions and palm center positions through masks generated by real position labels of the hand key points, wherein the dimensions of the masks are h w 1 and can be represented by tagmask.

Step S202 is to analyze the key point identification information of the hand key point according to the feature map of the image set, and obtain the predicted position tag value of the hand key point according to the mask and the key point identification information.

Based on the image set, the feature map of the output is obtained through the neural network, and in this embodiment, the output may be multiple outputs or may be a single output. Taking the multi-output as an example, in the case of using a target detection network algorithm YOLO (YOLO for short), the scale of the output feature map may be batch size h w c, where h and w are the sizes of the output feature map, and c is the number of channels of the output feature map. For example, in the case that the number of output feature maps is 2, the output feature maps are respectively used for predicting relevant information of fingertip points and palm center points, wherein the scale of the output feature map for predicting fingertip points is 32 × 60 × 80 × 7, the size of the output feature map for predicting palm center points is 32 × 30 × 40 × 7, wherein the number of channels is 7, which means that the output feature map has 7 channels, which are respectively represented by 1 to 7, wherein 1 represents the confidence coefficient of the output feature map, 2 to 5 represents the position information of the image, such as the height, width, and horizontal and vertical coordinates of the image, 6 represents the classification of the output feature map, in this embodiment, the classification is a hand image, 7 represents the tag label of the output feature map, which can be used for tag prediction, thereby obtaining the key point identification information, which can be represented by tagmap, and the key point identification information, which refers to fingertip points, is represented by the dimension 32 × 60 × 1, the keypoint identification information of the palm-center point is represented by dimension 32 × 30 × 40 × 1.

In other embodiments, when the output is a single output and the positions of the palm point and the fingertip point need to be predicted simultaneously in one feature map, the number of channels needs to be increased, the channel 6 is classified as the fingertip point, the channel 7 is classified as the palm point, and the increased channel 8 is a tag label.

The detection method used in this embodiment is implemented in the YOLO framework, and in other embodiments, may also be implemented in other detection frameworks such as a target detection algorithm SSD (SSD), fast R-CNN (CNN), and the like.

Specifically, when the tagmask of the fingertip point is represented by 32 × 4 × 60 × 80 × 1, and the tagmask of the palm center point is represented by 32 × 4 × 30 × 40 × 1, the tagmask obtained from the network forward direction is multiplied by the corresponding point of the corresponding tagmask as an element to obtain the predicted position tag value of the hand key point, the predicted position tag value may be represented by tags, the tags of the fingertip point are represented by 32 × 4 × 60 × 80 × 1, the tags of the palm center point are represented by 32 × 4 × 30 × 40 × 1, only the tag values of the real fingertip and the palm center point are retained by the tags, and the other positions are set to 0, where 4 is a preset hyper parameter.

Step S203, according to the predicted position label value and the real position label, performing loss function regression on the hand key point to obtain a hand key point detection model, and performing hand key point detection through the hand key point detection model. Wherein the loss function helps to optimize parameters of the neural network and to evaluate the neural network model.

Through the steps S201 to S203, the embodiment of the present application only classifies the fingertip points and the palm center points of the hand, does not need to sort and label the keypoints, and compared with the related art, the method for detecting the keypoints can be completed under the condition of giving information such as the thumb, the index finger, the middle finger, and the like.

In other embodiments, the algorithm for detecting the key points provided by the present application has better expansibility and applicability, and supports an algorithm for estimating the pose of any type of key points including human key points and hand key points, so that the estimation of the human key points can be performed by the method for detecting the key points provided by the present application.

In some embodiments, obtaining the predicted location tag value for the hand keypoint comprises: and adjusting the predicted position label value by combining the absolute position information of the key point of the hand. In the process of detecting human bodies or hand key points, under the condition that the shapes of objects to be recognized are approximately similar, great difficulty is caused to network learning, for example, in a hand data set of a depth foreground, the hand shape is often similar to the background color, sufficient information cannot be extracted from a depth foreground image directly for tag distinguishing, and the problem can be effectively solved by explicitly introducing absolute position information, wherein the absolute position information can be realized through a grid, and the grid information is essentially used for providing the absolute position of each object to be recognized in an image and providing auxiliary information to help the network to learn tag labels. By adding Grid auxiliary information in network input, the network learning can be guided to learn the clustering label related to the absolute position, the difficulty of network learning is reduced, network convergence is facilitated, and the stability of neural network training is improved. Meanwhile, the addition of absolute position information can help the neural network model to obtain a tag value which is more distinctive and is associated with the specific position of the object in the image.

In a general associated Embedding (referred to as AE) algorithm, it is default that training data includes a large amount of multi-object data, that is, one image includes a plurality of objects to be recognized, so as to facilitate calculation of tag loss. In the embodiment, tag training is completed in a single-target data set by introducing characteristics of batch loss function calculation and grid information, that is, all images in the data set have only one object to be recognized, and a model obtained by training can support multi-target detection and classification during actual deployment without special acquisition or generation of a corresponding multi-target data set.

In some embodiments, obtaining absolute position information for the hand keypoints comprises: the absolute position information is derived from a grid map, wherein the grid map provides grid point coordinates. Fig. 3 is a schematic diagram of grid point coordinates in the x direction according to an embodiment of the present application, as shown in fig. 3, each row has a different numerical value and represents the coordinate in the x direction, fig. 4 is a schematic diagram of grid point coordinates in the y direction according to an embodiment of the present application, as shown in fig. 4, each column has a different numerical value and represents the coordinate in the y direction, the two graphs each have a size of H £ W and are consistent with the input scale of an image set, and in the process of performing hand keypoint detection, fig. 3 and fig. 4 may be stacked with an input image and then sent as a single input to a neural network for training, or may be sent to the neural network before calculating a loss function by opening up multiple input branches. In this embodiment, the absolute position information generated by grid point coordinates is introduced to adjust the predicted position label value of the hand key point, which is helpful for clustering the finger tip point and the palm center point, and improves the detection accuracy of the hand key point.

In some embodiments, obtaining absolute position information of the hand keypoints further comprises: the absolute position information is obtained from an offset map, wherein the predicted position tag value is adjusted according to a product of the offset map and the predicted position tag value. Fig. 5 is an offset map according to an embodiment of the present application, and as shown in fig. 5, the size of the offset map is equal to tagmap, h × w, in fig. 5, Δ represents an offset value, and the value of Δ is equal to 1/w, so that the offset map is equidistantly distributed line by line. For example, in the case of a tagmap size of 60 × 80, the value of Δ is 1/80, i.e., 0.0125, and the supervision information of the absolute position of the keypoint is obtained by multiplying each element in the tagmap element by element with the offset map. For example, in the case where the points to be identified are a and B, respectively, the label value of a is a and the label value of B is B before passing through the offset map, and the label value of a becomes a × 0.0125 and the label value of B becomes B × 0.025 after passing through the offset map, the hand keypoint detection model makes a × 0.0125 and B × 0.025 as similar as possible. In this embodiment, the tag value of the predicted position is adjusted through the offset map, and the tag can be directly optimized and adjusted, so as to implement a position-aware loss function. Since the calculation loss function is the calculation in the batch, especially the grid auxiliary information of the offset map is added, even if the images of all training data only contain one hand, the learning and training of tag are supported, and corresponding labels can be provided for multiple hands in the actual measurement process.

In some embodiments, fig. 6 is a flowchart of another method for detecting hand keypoints according to an embodiment of the present application, where as shown in fig. 6, the flowchart includes the following further steps:

step S601, obtaining palm center points and fingertip points of the image to be detected through the hand key point model. In the training process of the hand key point model in this embodiment, only the hand key points in the training data are divided into palm center points and fingertip points, and the palm center points and the fingertip points of the image to be detected in this embodiment are obtained forward through the hand key point model.

Step S602, aggregating the palm point and the fingertip point according to the palm point and the label value of the fingertip point by using the palm point as a reference, wherein the palm point corresponds to the label value of the fingertip point. And returning label values of the palm center point and the fingertip point through the hand key point model, representing by using a tag label, and gathering the fingertip point and the palm center point with similar tag values by using the palm center point as a reference to realize the binding of the fingertip point and the palm center point, wherein the fingertip point is taken as a false detection point to be removed under the condition that the fingertip point does not have a corresponding palm center point.

Through the steps S601 and S602, the fingertip points and the palm points are aggregated according to the label values, so that the detection of the hand key points is completed, the detection rate is improved, and the accuracy of the detection result is improved.

In some embodiments, fig. 7 is a flowchart of a method of generating a hand keypoint mask according to an embodiment of the application, the method comprising the steps of:

step S701, determining mask parameters according to the number of hands in the image set, and distinguishing the foreground and the background of each hand according to the real position label of the hand key point to obtain a hand mask. The number of hands in each training sample in the image set is counted in advance, the maximum value of the number of hands is a mask parameter, and the mask parameter is a preset hyper parameter and is represented by max _ hand _ num. And generating a hand mask according to the real palm center position and the fingertip position of each hand in the training sample, wherein the hand mask is represented by tagmask, the foreground is set to be 1, the background is set to be 0, and the dimension of the hand mask is h w 1.

Step S702 obtains a mask of the hand key points of the image set according to the mask parameter and the hand mask.

After the hand masks are obtained, all the hand masks in each training image are combined, the obtained data dimension is max _ hand _ num _ h _ w _ 1, and when the number of hands in the training image is less than max _ hand _ num, all the numerical values in the masks corresponding to the training images are 0.

Combining all the hand masks in the image set to obtain the masks of the hand key points of the image set, wherein the data dimension of the masks is batch size max _ hand _ num × h × w 1, for example, when the value of the batch size is 32, and when the value of max _ hand _ num is 4, the output dimension of the finger tip point is 32 4 60 801, and the output dimension of the palm center point is 32 × 4 30 40 × 1.

Through the steps S701 and S702, the hand key points are classified into the finger tip points and the palm center points, so that the mask of the hand key points is formed, the problems of the indefinite number and the disorder detection of the hand key points are solved, and the adaptability of the hand key point detection model is improved.

In some embodiments, fig. 8 is a flow chart of a method of regression of a loss function according to an embodiment of the present application, as shown in fig. 8, the method comprising the steps of:

step S801, according to the center of palm information in the mask, filters the predicted position tag value, and retains image data including a hand.

Before screening, dimension transformation needs to be performed on the predicted position tag value and the mask, and the process of the dimension transformation is as follows: the dimensions of the predicted position tags and the mask tagmask are both transformed into (batch size max _ hand _ num, -1), where, -1 is the compression of the values of h and w, for example, the dimension after the fingertip point is subjected to dimension transformation is (128, 4800), where 128 represents the result of performing the dimension transformation when the value of batch size is 32 and the value of max _ hand _ num is 4, and 4800 is the result of multiplying h and w when the value of h is 60 and the value of w is 80.

When the dimension after dimension transformation of the palm point is (128, 1200), tags corresponds to tag data of the palm point, tagmask corresponds to a mask of the palm point, 128 represents all hands of the current batch, and null data with all values of 0 are included, because each hand has only one palm point, only one value of 1200 values of tagmask is 1, and the other values are all 0, and only one value of 1200 values corresponding to tags is corresponding to tag, and the other values are all 0.

And screening out real and effective hand data from 128 hands according to the existence of the palm points through the mask of the palm points, and removing null data.

Step S802, performing loss function regression on the image data according to the average value of the predicted position tag values of the hand and the average value of the key point tag values. Wherein the loss function can be obtained from the following equations 1 and 2:

in formula 1 and formula 2, n represents the nth person, k represents the kth joint, and x_nkIndicates the pixel position where the key point is located, h_k(x_nk) The label value, denoted tag,

mean of predicted location tag values, L, representing all K keypoints for the nth person_g(h, T) is a loss function of the tag part, the first half of the formula is used to make tag values of all joint points in each person as close as possible, and the second half of the formula is used to separate average tag values between different persons as possible.

Through the above steps S801 and S802, the loss function is regressed to improve the robustness of the hand key point detection model.

It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

The present embodiment further provides a device for detecting a key point of a hand, where the device is used to implement the foregoing embodiments and preferred embodiments, and the description of the device that has been already made is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

In some embodiments, fig. 9 is a block diagram of a device for detecting hand keypoints according to an embodiment of the present application, and as shown in fig. 9, the device includes a generation module 91, a prediction module 92, and a regression module 93:

the generating module 91 is configured to acquire an image set, and generate a mask of a hand key point according to a real position label of the hand key point in the image set, where the hand key point is classified into a fingertip point and a palm center point.

And a prediction module 92, configured to analyze the key point identification information of the hand key point according to the feature map of the image set, and obtain a predicted position tag value of the hand key point according to the mask and the key point identification information.

And a regression module 93, configured to perform loss function regression on the hand key points according to the predicted position tag value and the real position tag to obtain a hand key point detection model, and perform hand key point detection through the hand key point detection model.

In the device for detecting the hand key points, the generation module 91 only classifies the finger tip points and the palm center points of the hand key points in the training process without sequencing and marking the key points, on this basis, the hand key points are predicted by the prediction module 92 and the regression module 93, which, compared to the related art, the method for detecting the key points can be finished under the condition of giving information of a thumb, an index finger, a middle finger and the like, the method can realize the detection of the key points with indefinite quantity and disorder, and solves the problem of higher cost of a method for estimating the posture depending on the key points arranged in sequence in the related technology, and under the condition of indefinite quantity and disorder key points, the finger tip points and the palm center points are aggregated, so that the scene adaptability of the hand key point detection is improved, and the cost of the hand key point detection is reduced.

Fig. 10 is a block diagram of another hand keypoint detection device according to an embodiment of the present application, and as shown in fig. 10, the device includes all the modules shown in fig. 9, and further includes an adjusting unit 1001: the adjusting unit 1001 is configured to adjust the predicted position tag value according to the absolute position information of the hand key point. In this embodiment, the adjustment unit 1001 adds the mesh information of the absolute position to help the model obtain a tag value with more distinctiveness. The grid information may be that an xy grid map is added in addition to the image when the network is input, or offset grid transformation is introduced after the tag value is calculated. Because the calculation of the loss function is optimized to the calculation in the batch, particularly after the grid information is added, even if only one hand is contained in the images of all training data, the learning and training of tag are supported, and corresponding labels can be provided for multiple hands during actual measurement.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of hand keypoint detection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

In an embodiment, fig. 11 is a schematic internal structural diagram of a computer device according to an embodiment of the present application, and as shown in fig. 11, a computer device is provided, where the computer device may be a server, and its internal structural diagram may be as shown in fig. 11. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of hand keypoint detection.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the method for detecting hand keypoints provided by the above embodiments.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the steps in the method of hand keypoint detection provided by the various embodiments described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of hand keypoint detection, the method comprising:

2. A method of hand keypoint detection as claimed in claim 1, wherein said deriving predicted position label values for said hand keypoints comprises:

3. A method of hand keypoint detection as claimed in claim 2, wherein obtaining absolute position information of said hand keypoints comprises:

4. A method of hand keypoint detection as claimed in claim 2, wherein obtaining absolute position information of said hand keypoints comprises:

5. A method of hand keypoint detection according to claim 1, wherein said hand keypoint detection by said hand keypoint detection model comprises:

6. A method of hand keypoint detection as claimed in claim 1, wherein said generating a mask of hand keypoints from their true position labels in said set of images comprises:

7. A method of hand keypoint detection as claimed in claim 1, wherein said performing a loss function regression on said hand keypoints comprises:

8. A device for hand keypoint detection, the device comprising a generation module, a prediction module and a regression module:

9. A device for hand keypoint detection according to claim 8, wherein said prediction module comprises an adjustment unit:

10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.