CN112233161A

CN112233161A - Hand image depth determination method and device, electronic equipment and storage medium

Info

Publication number: CN112233161A
Application number: CN202011102705.6A
Authority: CN
Inventors: 董亚娇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-01-15
Anticipated expiration: 2040-10-15
Also published as: CN112233161B

Abstract

The method comprises the steps of inputting a to-be-processed image into a key point detection network to detect key points of a hand region when the hand region in the to-be-processed image comprising the hand region is subjected to depth analysis, wherein the key point characteristics of the hand region are obtained by processing a preset characteristic layer of the key point detection network, and the preset characteristic layer is connected with a depth detection network, so that the key point characteristics of the hand region are input into the depth detection network to be subjected to depth detection, and the depth information of the hand region is obtained. According to the process, the hand region in the image is captured by combining the hand key point characteristics, so that the interference generated by the background in the image is avoided, and the accuracy of the depth analysis result of the hand region is improved.

Description

Hand image depth determination method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for determining a depth of a hand image, an electronic device, and a storage medium.

Background

In the related art, the implementation scheme of hand image depth estimation mainly trains a depth prediction network by using hand image data labeled with depth. Because real hand depth data is noisy, a depth prediction network is usually trained using virtual hand images labeled with hand depth data. However, the background of the virtual hand image is usually a single color, and therefore, such depth prediction networks are very susceptible to complex backgrounds in practical applications, for example, a background close to the color of the hand is predicted as the hand, thereby resulting in low accuracy of the prediction result of the hand depth.

Disclosure of Invention

The present disclosure provides a method and an apparatus for determining depth of a hand image, an electronic device, and a storage medium, so as to at least solve the problem of low accuracy of a depth analysis result of a hand image in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a hand image depth determination method, including:

acquiring an image to be processed containing a hand area;

inputting the image to be processed into a key point detection network, and detecting key points of the hand area;

connecting a depth detection network on a preset feature layer of the key point detection network so as to input key point features of hand regions obtained by processing of the preset feature layer into the depth detection network for depth detection, and obtaining depth information of the hand regions.

In a possible implementation manner of the first aspect, the step of connecting a preset feature layer of the detection network to a deep detection network at the key point includes:

the key point detection network comprises a plurality of hourglass network structures which are connected in sequence, and a feature layer with the smallest size in the last hourglass network structure in the hourglass network structures is connected with the depth detection network.

In another possible implementation manner of the first aspect, the step of inputting the key point features of the hand region obtained by the detection of the preset feature layer into the depth detection network for performing depth detection, and obtaining the depth information of the hand region includes:

inputting the key point characteristics of the hand area into the depth detection network for depth detection to obtain the depth value of each pixel point of the hand area;

when the depth value of the pixel point is greater than or equal to a preset depth threshold value, determining that the pixel point belongs to a background area;

when the depth value of the pixel point is smaller than the preset depth threshold value, determining that the pixel point belongs to the hand area;

and obtaining the depth information of the hand area according to the depth values of all the pixel points belonging to the hand area.

inputting the key point characteristics of the hand area into the depth detection network for depth detection to obtain relative depth values between each pixel point contained in the hand area and a preset reference hand key point;

when the relative depth value between the pixel point and the reference hand key point is greater than or equal to a preset relative depth threshold value, determining that the pixel point belongs to a background area;

when the relative depth value between the pixel point and the reference hand key point is smaller than the preset relative depth threshold value, determining that the pixel point belongs to the hand area;

In another possible implementation manner of the first aspect, the method further includes:

acquiring an absolute depth value corresponding to the reference hand key point;

and superposing the relative depth value between each pixel point contained in the hand area and the reference hand key point with the absolute depth value of the reference hand key point to obtain the absolute depth value of each pixel point contained in the hand area.

In another possible implementation manner of the first aspect, the training process of the keypoint detection network includes:

acquiring a key point training sample set containing hand key point annotation information, wherein the key point training sample set comprises a real hand sample image and a virtual hand sample image, the real hand sample image comprises two-dimensional hand key point annotation information, and the virtual hand sample image comprises three-dimensional hand key point annotation information;

inputting the training samples in the key point training sample set into an initial key point detection network to perform hand key point detection, and obtaining hand key point detection results of the training samples, wherein the hand key point detection results corresponding to the virtual hand sample images comprise three-dimensional position information of each hand key point, and the hand key point detection results corresponding to the real hand sample images comprise two-dimensional position information of each hand key point;

calculating to obtain the key point loss corresponding to the training sample according to the hand key point detection result corresponding to the training sample and the hand key point labeling information, and adjusting the network parameters of the initial key point detection network according to the key point loss until the key point loss meets the key point convergence condition to obtain the key point detection network.

In another possible implementation manner of the first aspect, the training process of the deep detection network includes:

acquiring a virtual hand image training sample set marked with relative depth marking information, wherein the relative depth marking information is the relative depth of each pixel point relative to a reference hand key point;

inputting training samples in the virtual hand image training sample set into a key point detection network obtained through training, and obtaining key point characteristics of a hand region of the training samples on a preset characteristic layer of the key point detection network;

inputting the key point characteristics of the hand region of the training sample into an initial depth detection network to obtain the relative depth detection result of each pixel point in the training sample relative to the reference hand key point;

and calculating to obtain the depth loss corresponding to the training sample according to the relative depth detection result corresponding to the training sample and the relative depth marking information, and adjusting the network parameters of the initial depth detection network according to the depth loss until the depth loss meets a depth convergence condition to obtain the depth detection network.

In yet another possible implementation of the first aspect, the reference hand keypoint is a palm root keypoint.

According to a second aspect of the embodiments of the present disclosure, there is provided a hand image depth determination apparatus comprising:

the image processing device comprises a to-be-processed image acquisition module, a to-be-processed image acquisition module and a processing module, wherein the to-be-processed image acquisition module is configured to acquire a to-be-processed image containing a hand area;

the key point detection module is configured to input the image to be processed into a key point detection network and detect key points of the hand area;

and the depth detection module is configured to connect a depth detection network with a preset feature layer of the key point detection network, so that the key point features of the hand region processed by the preset feature layer are input into the depth detection network for depth detection, and the depth information of the hand region is obtained.

In a possible implementation manner of the second aspect, the depth detection module is specifically configured to:

In another possible implementation manner of the second aspect, the depth detection module includes:

the depth detection submodule is configured to input the key point features of the hand area into the depth detection network for depth detection to obtain the depth value of each pixel point of the hand area;

the background determining submodule is configured to determine that the pixel point belongs to a background area when the depth value of the pixel point is greater than or equal to a preset depth threshold;

a hand region determination submodule configured to determine that the pixel point belongs to the hand region when the depth value of the pixel point is smaller than the preset depth threshold;

a hand depth determination submodule configured to obtain depth information of the hand region from depth values of all pixel points belonging to the hand region.

In yet another possible implementation manner of the second aspect, the depth detection module is specifically configured to:

In yet another possible implementation manner of the second aspect, the apparatus further includes:

a reference point depth acquisition module configured to acquire absolute depth values corresponding to the reference hand key points;

and the hand area absolute depth acquisition module is configured to superimpose the relative depth value between each pixel point contained in the hand area and the reference hand key point with the absolute depth value of the reference hand key point to obtain the absolute depth value of each pixel point contained in the hand area.

a key point training sample acquisition module configured to acquire a key point training sample set including hand key point annotation information, the key point training sample set including a real hand sample image and a virtual hand sample image, wherein the real hand sample image includes two-dimensional hand key point annotation information, and the virtual hand sample image includes three-dimensional hand key point annotation information;

a sample key point detection module configured to input the training samples in the key point training sample set into an initial key point detection network to perform hand key point detection, so as to obtain hand key point detection results of the training samples, wherein the hand key point detection results corresponding to the virtual hand sample images include three-dimensional position information of each hand key point, and the hand key point detection results corresponding to the real hand sample images include two-dimensional position information of each hand key point;

and the key point detection network adjusting module is configured to calculate and obtain a key point loss corresponding to the training sample according to the hand key point detection result and the hand key point labeling information corresponding to the training sample, and adjust the network parameters of the initial key point detection network according to the key point loss until the key point loss meets a key point convergence condition, so as to obtain the key point detection network.

In yet another possible implementation manner of the second aspect, the apparatus includes:

the depth training sample acquisition module is configured to acquire a virtual hand image training sample set marked with relative depth marking information, wherein the relative depth marking information is the relative depth of each pixel point relative to a reference hand key point;

a sample depth detection module, configured to input training samples in the virtual hand image training sample set into a trained key point detection network, obtain key point features of a hand region of the training samples in a preset feature layer of the key point detection network, input the key point features of the hand region of the training samples into an initial depth detection network, and obtain a relative depth detection result of each pixel point in the training samples relative to the reference hand key point;

and the depth detection network adjusting module is configured to calculate and obtain a depth loss corresponding to the training sample according to the relative depth detection result and the relative depth marking information corresponding to the training sample, and adjust the network parameters of the initial depth detection network according to the depth loss until the depth loss meets a depth convergence condition, so as to obtain the depth detection network.

In yet another possible implementation of the second aspect, the reference hand keypoint is a palm root keypoint.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the hand image depth determination method of any of the first aspects.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the hand image depth determination method of any one of the first aspects.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product having instructions stored therein, which when executed by a processor in the electronic device, are configured to implement the hand image depth determination method of any one of the first aspects.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: when the depth analysis is carried out on the hand region in the image to be processed containing the hand region, the image to be processed is firstly input into a key point detection network to detect key points of the hand region, wherein the key point detection network is processed by a preset feature layer to obtain key point features of the hand region, the preset feature layer is connected with a depth detection network, and therefore the key point features of the hand region are input into the depth detection network to carry out depth detection, and the depth information of the hand region is obtained. According to the process, the scheme combines the key point characteristics of the hand to capture the hand area in the image, so that the interference of the background in the image is avoided. Therefore, the accuracy of the hand region depth analysis result is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a method of hand image depth determination in accordance with an exemplary embodiment;

FIG. 2 is a schematic diagram illustrating key points of a hand image in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a process of obtaining depth information for a hand image according to an exemplary embodiment;

FIG. 4 is a flow diagram illustrating another method of hand image depth determination in accordance with an exemplary embodiment;

FIG. 5 is a network architecture diagram illustrating a keypoint detection network and a deep detection network, according to an example embodiment;

FIG. 6 is a flow diagram illustrating a key point detection network training process in accordance with an exemplary embodiment;

FIG. 7 is a flow diagram illustrating a deep-inspection network training process in accordance with an exemplary embodiment;

FIG. 8 is a block diagram illustrating a hand image depth determination apparatus in accordance with an exemplary embodiment;

FIG. 9 is a block diagram illustrating another hand image depth determination device in accordance with an exemplary embodiment;

FIG. 10 is a block diagram illustrating another hand image depth determination device in accordance with an exemplary embodiment;

FIG. 11 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flow diagram illustrating a method for hand image depth determination for use in a device having computing capabilities, such as a PC, server, or mobile smart terminal, according to an example embodiment. As shown in fig. 1, the method may include the following steps.

In S110, an image to be processed including a hand region is acquired.

In one possible implementation, the image to be processed may be an image including a hand region captured by the smart terminal. In another possible implementation, the image to be processed may also be a locally stored image containing a hand region.

In S120, the image to be processed is input to the key point detection network to detect the key points of the hand region.

Before the image to be processed is subjected to hand region depth analysis, hand key points of a hand region in the image to be processed are analyzed to obtain the hand region key point characteristics.

In a possible implementation manner, the key point detection network may be used to analyze key point information of a hand region included in the image to be processed, so as to obtain a feature of the hand key point. The key point detection network can adopt a deep convolutional neural network.

The key point detection network can detect the position of the hand region in the image and locate each key point of the hand region, as shown in fig. 2, the hand key points mainly include 21 main bone nodes, for example, the finger tips of the hand, the joints of the phalanges of each segment, and the like.

In S130, a depth detection network is connected to the preset feature layer of the key point detection network, so that the key point features of the hand region detected by the preset feature layer are input into the depth detection network for depth detection, and the depth information of the hand region is obtained.

The key point detection network comprises a plurality of characteristic layers, the characteristics are extracted from the image to be processed layer by using the plurality of characteristic layers, and finally the position of each key point contained in the hand area in the image to be processed is determined. The method comprises the steps of obtaining key point characteristics of a hand region in a preset characteristic layer, wherein the key point characteristics of the hand region comprise the characteristics of most pixel points of the hand region.

In one possible implementation of the present disclosure, the key point detection network employs a neural network including a plurality of hourglass network structures connected in sequence, and accordingly, the preset feature layer may be a feature layer with a smallest size in the last hourglass network structure. For example, the keypoint detection network comprises two hourglass network structures, wherein the smallest-sized feature layer in the second hourglass network structure is the predetermined feature layer described above.

And connecting the preset feature layer with a depth detection network so as to input the key point features of the hand region detected by the preset feature layer into the depth detection network for depth detection, and finally obtaining the depth information of the hand region.

The technical scheme provided by the embodiment at least brings the following beneficial effects: when the depth analysis is carried out on the hand region in the image, the hand region key point feature in the image is detected by firstly carrying out the hand key point detection on the hand region in the image. And then carrying out depth analysis on the key point characteristics of the hand region to obtain the depth information of the hand region. The scheme combines the characteristics of the key points of the hands to capture the hand area in the image, thereby avoiding the interference generated by the background in the image and improving the accuracy of the depth analysis result of the hand area.

In one possible implementation, as shown in fig. 3, the background and the hand region may be further distinguished by using depth information of pixel points in the image, where the distinguishing process may include:

in S131, the hand region key point features are input into the depth detection network for depth detection, so as to obtain the depth value of each pixel point in the hand region.

The depth of the hand region in the image containing the hand is usually small, and the depth of the background region is large, so that the background region and the hand region can be further distinguished by setting a depth threshold value, and finally the depth information of each pixel point belonging to the hand region is obtained.

The preset depth threshold may be set according to actual service requirements, which is not limited by the present disclosure.

In S132, if the depth of the pixel is greater than or equal to the preset depth threshold, it is determined that the pixel belongs to the background region.

In S133, if the depth of the pixel is smaller than the preset depth threshold, it is determined that the pixel belongs to the hand region.

In S134, depth information of the hand region is obtained from the depth values of all the pixels belonging to the hand region.

After each pixel point in the hand area is subjected to the judging process, all pixel points belonging to the hand area are obtained, and the depth information corresponding to the pixel points is the depth information of the hand area.

In one possible implementation of the present disclosure, the depth difference between each pixel point in the hand region is small, and in order to increase the contrast, one hand key point (e.g., a palm root node) is selected as a reference hand key point. The depth detection network outputs relative depth values of the pixel points in the hand region and the reference hand keypoint. Correspondingly, if the relative depth value of the pixel point is greater than or equal to a preset relative depth threshold value, determining that the pixel point belongs to the background area; if the relative depth value of the pixel point is smaller than a preset relative depth threshold value, determining that the pixel point belongs to the hand area; and obtaining the relative depth of the hand area according to the relative depth values of all the pixel points belonging to the hand area.

In the method for determining depth of a hand image provided in this embodiment, after depth information of a hand region is obtained by combining features of key points of a hand, by comparing depth values of respective pixel points of the hand region with a preset depth threshold, pixel points belonging to a background region are further identified, and then pixel points belonging to the background region are eliminated from respective pixel points of the hand region, so as to obtain all pixel points belonging to the hand region, and finally, depth values of all pixel points belonging to the hand region are obtained as depth information of the hand region. According to the scheme, the pixel points belonging to the background area are further screened out by utilizing the depth values of the pixel points, and the accuracy of the depth information of the hand area is further improved.

In one possible implementation of the present disclosure, the depth detection network outputs relative depth values of each pixel point in the hand region and the reference hand keypoints. Therefore, in an application scenario in which the absolute depth value of the hand region needs to be used, the relative depth of each pixel point in the hand region needs to be converted into an absolute depth.

Fig. 4 is a flowchart illustrating another hand image depth determination method according to an exemplary embodiment, where the present embodiment further includes the following steps based on the embodiment illustrated in fig. 1.

In S140, absolute depth values of the reference hand keypoints are acquired.

In one possible implementation, the keypoint detection network is trained by using a part of training samples containing three-dimensional keypoint annotation information, so that the keypoint detection network obtains three-dimensional information (x, y, z) of the hand keypoints in the hand region, where x and y represent coordinates of the hand keypoints on the image plane, and z represents coordinates of the hand keypoints in the depth direction perpendicular to the image plane.

In one embodiment of the present disclosure, the reference hand keypoints may be palm root keypoints. For example, the hand key point numbered 0 in fig. 2 is the palm root key point.

Of course, in other embodiments of the present disclosure, other key points of the hand may also be determined as reference points, which is not described herein again.

In S150, the relative depth values between each pixel point of the hand region and the reference hand key point are superimposed with the absolute depth values of the reference hand key points to obtain the absolute depth values of each pixel point of the hand region.

In the hand image depth determining method provided in this embodiment, after the depth detection network detects and obtains the relative depth value between each pixel point of the hand region in the image to be processed and the reference hand key point, the absolute depth value of each pixel point of the hand region is restored and obtained according to the absolute depth value of the reference hand key point and the relative depth value of each pixel point of the hand region, so as to meet the requirements of different scenes.

Fig. 5 is a network structure diagram illustrating a key point detection network and a deep detection network according to an exemplary embodiment.

As shown in fig. 5, the present embodiment is described by taking an example that the key point detection network adopts a neural network with a double-hourglass network structure, that is, the neural network with the double-hourglass network structure includes two hourglass network structures, where the size of the feature layer in each hourglass network structure is first reduced and then increased.

In this embodiment, the depth detection network is connected to the feature layer with the smallest size in the second hourglass (i.e., the preset feature layer) in the double-hourglass network structure, and this structure enables the depth detection network to directly utilize the hand region key point features extracted from the feature layer with the smallest size in the second hourglass in the double-hourglass network structure.

In other embodiments of the present disclosure, the key point detection network may also be implemented by using other neural networks, and then the feature layer capable of obtaining most of the pixel point features in the hand region is determined as the preset feature layer according to the specific network structure of the neural network, which is not described herein any more.

The depth detection network is connected to a preset feature layer of the key point detection network, namely, features extracted by the preset feature layer are directly transmitted to the depth detection network, and therefore, the accuracy of the key point detection network directly affects the result accuracy of a later-stage depth detection network, the accuracy of the key point detection network needs to be ensured, and in order to improve the accuracy of the key point detection network, real hand images and virtual hand images can be used for combined training.

When the depth detection network is trained, because the depth annotation data of the real hand image has large noise, the depth detection network is usually trained by using the virtual hand image with accurate depth annotation data.

In addition, the key point detection network needs to be trained jointly by using the real hand image and the virtual hand image, while the depth detection network needs to be trained only by using the virtual hand image, so that the two networks need to be trained independently.

The key point detection network has the function of determining the position of the hand region according to the positions of the hand key points in the predicted image, so that the depth detection network can accurately identify the hand region and the background region when performing depth prediction on the hand region, namely, the accuracy of the key point detection network affects the accuracy of the depth detection network. Therefore, it is necessary to train the keypoint detection network first, and train the deep detection network after the keypoint detection network is trained.

In addition, in the process of training the deep detection network, if the network parameters of the key point detection network are not fixed, the parameters of the key point detection network are changed due to the deep loss back propagation process, and the accuracy of the key point prediction result of the key point detection network is affected. Therefore, when the deep detection network is trained, the network parameters of the key point detection network need to be fixed, and only the network parameters of the deep detection network need to be adjusted.

FIG. 6 is a flowchart illustrating a key point detection network training process, which may include the following steps, as shown in FIG. 6, according to an example embodiment.

In S210, a key point training sample set including hand key point annotation information is obtained.

In one possible implementation, the keypoint training sample set comprises real hand sample images and virtual hand sample images.

The virtual hand image is different from a real hand image, a specific hand shape is firstly made, preset information such as skin texture and background is added during rendering, and three-dimensional labeling information (x, y and z) of the hand key points is automatically generated during rendering, wherein the three-dimensional labeling information comprises two-dimensional coordinates (x and y) of each hand key point in the image and a coordinate z in the depth direction. However, due to limitations in virtual data generation technology, both background and hand shape information of virtual hand images are limited.

The real hand image is the hand image shot by the real human hand and contains rich background and hand type information. However, it is difficult to obtain accurate three-dimensional annotation information of the hand key points from the real hand image, and only two-dimensional annotation information, i.e. two-dimensional coordinates (x, y) of the hand key points in the image plane, can be obtained.

Therefore, in order to make the prediction effect of the keypoint detection network better, the real hand image and the virtual hand image need to be combined to train the keypoint detection network, so that the keypoint detection network can learn the features of the key points of the hand well and can also learn the features of the hand different from the background.

In addition, since the key point annotation information in the virtual hand image is three-dimensional annotation information and includes information of the key point in the depth direction, the key point detection network can learn information of the hand key point in the depth direction, that is, can finally predict three-dimensional position information of the hand key point.

In S220, the training samples in the key point training sample set are input to the initial key point detection network for hand key point detection, so as to obtain a hand key point detection result of the training samples.

The keypoint detection network may adopt a neural network with a double-funnel structure shown in fig. 5, and the hand keypoint network performs forward propagation on the keypoint training sample to obtain a hand keypoint detection result corresponding to the keypoint training sample. The detection result of the hand key points of the virtual hand sample image is three-dimensional position information of the hand key points; the hand key point detection result of the real hand sample image is two-dimensional position information of the hand key point.

In one possible implementation of the present disclosure, the key point convergence condition may be that the loss calculated using the loss function is no longer reduced.

In one possible implementation, the position information of the hand key points in the training samples can be converted into corresponding thermodynamic diagrams so that the key point detection network can extract the spatial information of the hand key points from the thermodynamic diagrams.

The thermodynamic diagram is a relatively common display mode in a data visualization project, and data information such as hot spot distribution, area aggregation and the like can be intuitively reflected through the color change degree. In the thermodynamic diagram corresponding to the key point training sample, the brightness (gray value) of the position area where the hand key point is located is higher than the brightness (gray value) of the position area where the hand key point is not located, so that the color of the area where the hand key point is located in the thermodynamic diagram is brighter. The key point detection network can easily judge whether the pixel point is a hand or a background by using the brightness or the gray value of each pixel point.

In S230, a keypoint loss is calculated according to the hand keypoint detection result and the hand keypoint annotation information corresponding to the training sample, and whether the keypoint loss satisfies a keypoint convergence condition is determined.

In a supervised machine learning network, the error between the predicted value and the true value of a single sample is called loss, and the smaller the loss, the better the network. The function used to calculate the loss is called the loss function. The quality of each predicted result of the network is measured by a loss function. The loss function here includes the loss between the hand key point detection result and the labeled hand key point, and the gradient information of the loss.

In S240, if the keypoint loss does not satisfy the keypoint convergence condition, adjusting the network parameters of the initial keypoint detection network according to the keypoint loss, updating the keypoint loss according to the adjusted detection result of the keypoint detection network, and continuing to determine whether the updated keypoint loss satisfies the keypoint convergence condition.

Predicting the hand key point information of each key point training sample by using an initial key point detection network, judging whether the loss (namely, the key point prediction error) of the current prediction result meets a key point convergence condition, if not, adjusting the network parameters, continuing to predict the hand key points of each hand key point training sample by using the adjusted network, continuing to calculate whether the loss of the current prediction result meets the key point convergence condition, if so, finishing the network training process, and determining the current network parameters as final network parameters.

In S250, if the keypoint loss satisfies the keypoint convergence condition, the initial keypoint detection network is determined to be the keypoint detection network.

In a possible implementation manner of the present disclosure, the weight parameter of each layer in the key point detection network when the gradient descent method is used to solve the loss function and satisfy the key point convergence condition may be used, so as to obtain the final key point detection network.

In the key point detection network training process provided by the embodiment, the real hand image and the virtual hand image are used for jointly training the key point detection network, the key point detection network can well learn the characteristics of the key points of the hands through the virtual hand image, and the real hand image can enable the key point detection network to learn the characteristics of the hand area different from the background area, so that the key point detection network obtained by the process training has a good key point detection effect.

FIG. 7 is a flowchart illustrating a deep-inspection network training process, which may include the following steps, as shown in FIG. 7, according to an example embodiment.

In S310, a virtual hand image training sample set labeled with relative depth labeling information is obtained.

The relative depth marking information is the relative depth of each pixel point in the virtual hand image relative to the key point of the reference hand. Because the depth difference of each pixel point in the hand region is small, in order to increase the contrast, a palm root key point (i.e., the key point labeled as 0 in fig. 2) is selected as a reference hand key point, the absolute depth values of other pixel points in the hand region are subtracted from the absolute depth values of the reference hand key point to obtain a relative depth map of the hand region, and then the relative depth data in the hand region is normalized, for example, to 0-1 interval, so that the learning difficulty of the depth detection network is reduced, and the network learning effect is improved.

In S320, the training samples in the virtual hand image training sample set are input into the trained keypoint detection network, and the hand region keypoint features of the training samples are obtained in the preset feature layer of the keypoint detection network.

In S330, the hand region key point features are input into the initial depth detection network, and a relative depth detection result of each pixel point in the training sample with respect to the reference hand key point is obtained.

In S340, according to the relative depth detection result and the relative depth labeling information corresponding to the training sample, a depth loss corresponding to the training sample is calculated, and network parameters of the initial depth detection network are adjusted according to the depth loss until the depth loss satisfies a depth convergence condition, so as to obtain the depth detection network.

And calculating to obtain depth loss by using a depth detection result obtained according to the analysis of the depth detection network and the relative depth marking information, and solving the network parameter when the depth loss meets the depth convergence condition by using a gradient descent method if the depth loss does not meet the depth convergence condition. The process is an iterative calculation process, depth loss needs to be recalculated and whether the depth loss meets the convergence condition is judged every time of iteration, and a final depth detection network is obtained until the depth loss meets the depth convergence condition.

In the depth detection network training process provided by the embodiment, the virtual hand image marked with accurate depth information is used for training the depth detection network independently, so that the depth detection network can well learn the depth characteristics of the hand, and the accuracy of the network prediction result is improved.

Corresponding to the hand image depth determination method embodiment, the present disclosure also provides a hand image depth determination device embodiment.

Fig. 8 is a block diagram illustrating a hand image depth determination apparatus according to an exemplary embodiment, and referring to fig. 8, the apparatus includes a to-be-processed image acquisition module 110, a keypoint detection module 120, and a depth detection module 130.

A to-be-processed acquisition module 110 configured to acquire an image to be processed including a hand region.

A key point detection module 120 configured to input the image to be processed into a key point detection network to detect key points of the hand region.

And the depth detection module 130 is configured to connect the depth detection network to a preset feature layer of the key point detection network, so as to input the key point features of the hand region detected by the preset feature layer into the depth detection network for performing depth detection to obtain the depth information of the hand region.

In one possible implementation, the key point detection network comprises a plurality of hourglass network structures connected in sequence, and the smallest-sized feature layer in the last hourglass network structure in the plurality of hourglass network structures is connected with the depth detection network.

In one possible implementation, the depth detection module 130 includes:

and the depth detection submodule is configured to input the key point features of the hand area into the depth detection network for depth detection to obtain the depth value of each pixel point of the hand area.

And the background determining submodule is configured to determine that the pixel point belongs to the background area when the depth value of the pixel point is greater than or equal to a preset depth threshold value.

And the hand area determination submodule is configured to determine that the pixel point belongs to the hand area when the depth value of the pixel point is smaller than a preset depth threshold value.

And the hand depth determination submodule is configured to obtain the depth information of the hand region according to the depth values of all the pixel points belonging to the hand region.

When depth analysis is performed on a hand region in an image, a hand image depth determination device provided in this embodiment first performs hand key point detection on the hand region in the image to obtain a feature of a key point of the hand region in the image. And then carrying out depth analysis on the key point characteristics of the hand region to obtain the depth information of the hand region. The scheme combines the characteristics of the key points of the hands to capture the hand area in the image, thereby avoiding the interference generated by the background in the image and improving the accuracy of the depth analysis result of the hand area.

Fig. 9 is a block diagram of another hand image depth determination apparatus according to an exemplary embodiment, and referring to fig. 9, the apparatus further includes a reference point depth acquisition module 210 and a hand region absolute depth acquisition module 220 based on the embodiment shown in fig. 8.

In another possible implementation, the depth detection module 130 is specifically configured to: and inputting the key point characteristics of the hand area into a depth detection network for depth detection to obtain the relative depth value between each pixel point contained in the hand area and a preset reference hand key point.

A reference point depth obtaining module 210 configured to obtain absolute depth values corresponding to the reference hand key points.

In one possible implementation, the reference hand keypoints may be palm root keypoints.

The hand region absolute depth acquiring module 220 is configured to superimpose the relative depth value between each pixel point included in the hand region and the reference hand key point with the absolute depth value of the reference hand key point to obtain the absolute depth value of each pixel point included in the hand region.

The hand image depth determining device provided in this embodiment utilizes a depth detection network to detect and obtain relative depth values between each pixel point of a hand region in an image to be processed and a reference hand key point, and then restores and obtains the absolute depth value of each pixel point of the hand region according to the absolute depth value of the reference hand key point and the relative depth value of each pixel point of the hand region, so as to meet the requirements of different scenes.

Fig. 10 is a block diagram of another hand image depth determination apparatus according to an exemplary embodiment, and referring to fig. 10, the apparatus further includes, on the basis of the embodiment shown in fig. 8: a key point training sample obtaining module 310, a sample key point detecting module 320, a key point detecting network adjusting module 330, a deep training sample obtaining module 340, a sample deep detecting module 350, and a deep detecting network adjusting module 360.

A key point training sample acquisition module 310 configured to acquire a key point training sample set including hand key point annotation information, the key point training sample set including a real hand sample image and a virtual hand sample image, wherein the real hand sample image includes two-dimensional hand key point annotation information, and the virtual hand sample image includes three-dimensional hand key point annotation information;

the sample key point detection module 320 is configured to input the training samples in the key point training sample set into the initial key point detection network to perform hand key point detection, so as to obtain hand key point detection results of the training samples, wherein the hand key point detection results corresponding to the virtual hand sample images include three-dimensional position information of each hand key point, and the hand key point detection results corresponding to the real hand sample images include two-dimensional position information of each hand key point;

the key point detection network adjusting module 330 is configured to calculate a key point loss corresponding to the training sample according to the hand key point detection result and the hand key point annotation information corresponding to the training sample, and adjust a network parameter of the initial key point detection network according to the key point loss until the key point loss meets the key point convergence condition, so as to obtain the key point detection network.

The depth training sample obtaining module 340 is configured to obtain a virtual hand image training sample set labeled with relative depth labeling information, where the relative depth labeling information is a relative depth of each pixel point with respect to a reference hand key point.

The sample depth detection module 350 is configured to input training samples in a virtual hand image training sample set into a trained key point detection network, obtain key point characteristics of a hand region of the training samples in a preset characteristic layer of the key point detection network, input the key point characteristics of the hand region into an initial depth detection network, and obtain a relative depth detection result of each pixel point in the training samples relative to a reference hand key point;

and the depth detection network adjusting module 360 is configured to calculate a depth loss corresponding to the training sample according to the relative depth detection result and the relative depth labeling information corresponding to the training sample, and adjust the network parameters of the initial depth detection network according to the depth loss until the depth loss meets a depth convergence condition, so as to obtain the depth detection network.

The hand image depth determination device provided in this embodiment is configured to jointly train a keypoint detection network by using a real hand image and a virtual hand image, enable the keypoint detection network to learn characteristics of keypoints of a hand well through the virtual hand image, and enable the keypoint detection network to learn characteristics of a hand region different from a background region through the real hand image. After the key point detection network is trained, the virtual hand image is used for training the depth detection network independently, so that the depth detection network can well learn the depth characteristics of the hand, and the accuracy of the network prediction result is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 11 is a block diagram illustrating an electronic device in accordance with an example embodiment. Referring to fig. 11, the electronic device includes a processor 410 and a memory 420; the processor 410 and the memory 420 are in communication with each other via a bus 430.

The memory 420 stores instructions executable by the processor 410, and the processor 410 executes the instructions in the memory 410 to implement the hand image depth determination method described above.

In an exemplary embodiment, a storage medium comprising instructions, such as a memory 420 comprising instructions, executable by a processor 410 of an electronic device to perform the above-described method is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product having instructions stored thereon which, when executed by a processor in an electronic device, implement the hand image depth determination method described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for determining depth of a hand image, comprising:

acquiring an image to be processed containing a hand area;

2. The method for determining the depth of a hand image according to claim 1, wherein the step of inputting the key point features of the hand region detected by the preset feature layer into the depth detection network for depth detection to obtain the depth information of the hand region comprises:

3. The method for determining the depth of a hand image according to claim 1, wherein the step of inputting the key point features of the hand region detected by the preset feature layer into the depth detection network for depth detection to obtain the depth information of the hand region comprises:

4. A hand image depth determination method as claimed in claim 3, further comprising:

5. A method for determining depth of hand image according to any of claims 1 to 4, wherein said step of connecting a depth detection network to a preset feature layer of said keypoint detection network comprises:

6. The method of determining hand image depth of claim 1, wherein the training process of the keypoint detection network comprises:

7. The method of determining hand image depth of claim 1, wherein the training process of the depth detection network comprises:

8. A hand image depth determination apparatus, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement a hand image depth determination method as claimed in any of claims 1 to 7.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform a hand image depth determination method as claimed in any one of claims 1 to 7.