CN111652110A

CN111652110A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN111652110A
Application number: CN202010470551.XA
Authority: CN
Inventors: 谢符宝; 刘文韬; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-11
Also published as: WO2021238163A1; TW202145065A

Abstract

The embodiment of the disclosure discloses an image processing method and device, electronic equipment and a storage medium. The method comprises the following steps: obtaining a detection frame of a hand of a first depth image in the multi-frame depth images; the first depth image is any one frame depth image in the multiple frames of depth images; and extracting features of the depth image in the detection frame, and obtaining three-dimensional coordinate data of key points of the hand part based on the extracted features.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

At present, infrared images or color images of monocular cameras or binocular cameras are mostly adopted for hand detection and tracking schemes in images, but the infrared images or the color images cannot determine accurate three-dimensional information of hands in the images, so that the hand detection and tracking can not be faster and more accurate.

Disclosure of Invention

The embodiment of the disclosure provides an image processing method and device, electronic equipment and a storage medium.

The embodiment of the present disclosure provides an image processing method, including: obtaining a detection frame of a hand of a first depth image in the multi-frame depth images; the first depth image is any one frame depth image in the multiple frames of depth images; and extracting features of the depth image in the detection frame, and obtaining three-dimensional coordinate data of key points of the hand part based on the extracted features.

In some optional embodiments of the disclosure, the obtaining the detection frame of the hand of the first depth image of the plurality of frames of depth images includes: in response to the condition that the first depth image is the first frame depth image in the multiple frames of depth images, performing hand detection processing on the first depth image to obtain a detection frame of a hand of the first depth image; in response to the situation that the first depth image is a non-first-frame depth image in the multi-frame depth images, obtaining a detection frame of a hand in a first depth image based on a detection frame of the hand in a second depth image; the second depth image is a frame image before the first depth image.

In some optional embodiments of the present disclosure, the obtaining a detection frame of a hand in the first depth image based on a detection frame of a hand in the second depth image includes: determining a first area based on an area where a detection frame of the hand in the second depth image is located; the first area is larger than the area where the detection frame is located; and determining a detection frame of the hand corresponding to the position range of the first area in the first depth image according to the first area.

In some optional embodiments of the present disclosure, before the feature extraction is performed on the depth image within the detection frame, the method further includes: determining the center depth of the hand in the detection frame, and performing centering processing on the depth image in the detection frame based on the center depth to obtain a centered depth image; the feature extraction of the depth image in the detection frame comprises the following steps: and performing feature extraction on the depth image subjected to the centralization treatment.

In some optional embodiments of the disclosure, the determining a center depth of the hand within the detection frame comprises: determining a center depth of a hand in the first depth image based on depth values of at least part of depth images within a detection box of the hand; the centering processing is performed on the depth image in the detection frame based on the center depth to obtain the depth image after centering processing, and the centering processing comprises the following steps: and adjusting the depth value of the depth image in the detection frame of the hand by using the central depth of the hand to obtain the depth image after centering processing.

In some optional embodiments of the present disclosure, the obtaining three-dimensional coordinate data of key points of the hand based on the extracted features includes: obtaining two-dimensional image coordinate data and depth data of key points of the hand based on the extracted features; the two-dimensional image coordinate data is data in an image coordinate system; obtaining internal parameters of image acquisition equipment for acquiring the multi-frame depth images; obtaining three-dimensional coordinate data of key points of the hand based on the two-dimensional image coordinate data, the depth data and the internal parameters; the three-dimensional coordinate data is data in a camera coordinate system.

In some optional embodiments of the disclosure, the method further comprises: determining a pose of a hand based on the three-dimensional coordinate data of the hand; determining an interactive instruction based on the gesture of the hand.

An embodiment of the present disclosure further provides an image processing apparatus, including: a first processing unit and a second processing unit; wherein the content of the first and second substances,

the first processing unit is used for obtaining a detection frame of a hand of a first depth image in the multi-frame depth images; the first depth image is any one frame depth image in the multiple frames of depth images;

and the second processing unit is used for extracting the features of the depth image in the detection frame and obtaining the three-dimensional coordinate data of the key points of the hand part based on the extracted features.

In some optional embodiments of the present disclosure, the first processing unit is configured to, in response to that the first depth image is a first depth image of the multiple depth images, perform hand detection processing on the first depth image to obtain a detection frame of a hand of the first depth image; in response to the situation that the first depth image is a non-first-frame depth image in the multi-frame depth images, obtaining a detection frame of a hand in a first depth image based on a detection frame of the hand in a second depth image; the second depth image is a frame image before the first depth image.

In some optional embodiments of the present disclosure, the first processing unit is configured to determine a first region based on a region in which a detection frame of a hand in the second depth image is located; the first area is larger than the area where the detection frame is located; and determining a detection frame of the hand corresponding to the position range of the first area in the first depth image according to the first area.

In some optional embodiments of the present disclosure, the apparatus further includes a third processing unit, configured to determine a center depth of the hand in the detection frame, and perform centering processing on the depth image in the detection frame based on the center depth to obtain a centered depth image;

and the second processing unit is used for extracting the features of the depth image after the centralization processing.

In some optional embodiments of the disclosure, the third processing unit is configured to determine a center depth of the hand based on depth values of at least part of the depth images within a detection frame of the hand in the first depth image; and adjusting the depth value of the depth image in the detection frame of the hand by using the central depth of the hand to obtain the depth image after centering processing.

In some optional embodiments of the present disclosure, the second processing unit is configured to obtain two-dimensional image coordinate data and depth data of key points of the hand based on the extracted features; the two-dimensional image coordinate data is data in an image coordinate system; obtaining internal parameters of image acquisition equipment for acquiring the multi-frame depth images; obtaining three-dimensional coordinate data of key points of the hand based on the two-dimensional image coordinate data, the depth data and the internal parameters; the three-dimensional coordinate data is data in a camera coordinate system.

In some optional embodiments of the present disclosure, the apparatus further comprises a fourth processing unit for determining a pose of a hand based on the three-dimensional coordinate data of the hand; determining an interactive instruction based on the gesture of the hand.

The disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method of the disclosed embodiments.

The embodiment of the present disclosure further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the method according to the embodiment of the present disclosure are implemented.

According to the image processing method and device, the electronic equipment and the storage medium, the detection frame of the hand is obtained by utilizing the depth image detection, and then the accurate three-dimensional coordinate data of the hand is obtained based on the depth image in the detection frame of the hand, so that the accurate detection of the three-dimensional coordinate of the hand is realized.

In addition, according to the embodiment of the present disclosure, the detection frame of the hand is obtained by performing the hand detection processing on the first frame depth image, and the detection frame of the hand in the subsequent depth image is obtained based on the detection frame of the hand in the previously obtained depth image, thereby realizing the tracking of the hand using the depth image.

Drawings

Fig. 1 is a first schematic flowchart of an image processing method according to an embodiment of the disclosure;

FIG. 2 is a diagram illustrating key points of a hand in an image processing method according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram illustrating a structure of a key point detection network in the image processing method according to the embodiment of the disclosure;

FIG. 4 is a second flowchart illustrating an image processing method according to an embodiment of the disclosure;

FIG. 5 is a first schematic diagram illustrating a first exemplary configuration of an image processing apparatus according to an embodiment of the disclosure;

FIG. 6 is a schematic diagram of a second exemplary embodiment of an image processing apparatus;

fig. 7 is a schematic diagram of a hardware component structure of an electronic device according to an embodiment of the disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and specific embodiments.

The embodiment of the disclosure provides an image processing method. Fig. 1 is a first schematic flowchart of an image processing method according to an embodiment of the disclosure; as shown in fig. 1, the method comprises:

step 101: obtaining a detection frame of a hand of a first depth image in the multi-frame depth images; the first depth image is any one frame depth image in the multi-frame depth images;

step 102: and extracting the features of the depth image in the detection frame, and obtaining the three-dimensional coordinate data of the key points of the hand part based on the extracted features.

The image processing method of the embodiment is applied to an image processing device; the image processing apparatus may be located in any electronic device having image processing capabilities. In some examples, the electronic device may be a computer, a cell phone, a Virtual Reality (VR) device, an Augmented Reality (AR) device, or like user device; in other examples, the electronic device may also be a server or the like. In each embodiment of the present application, an electronic device is taken as an example for explanation.

The multi-frame depth image in the embodiment can be acquired through built-in or external image acquisition equipment, and the image acquisition equipment can be depth image acquisition equipment. Illustratively, the depth image capturing device may be implemented by at least one Of a depth camera, a 3D structured light camera component, a Time Of Flight (TOF) camera component, and a laser radar component. In some optional embodiments, the electronic device may obtain the multiple frames of depth images through a built-in or external image acquisition device. In other alternative embodiments, the electronic device may also obtain, through the communication component, multiple frames of depth images transmitted by other electronic devices; the multi-frame depth image is acquired by image acquisition equipment built in or externally connected with other electronic equipment.

In some optional embodiments of the present application, the depth image may include two-dimensional image data and depth data; the two-dimensional image data represents a planar image of the acquired target scene; alternatively, the two-dimensional image may be a red, green, blue (RGB) image, and may also be a grayscale image. The depth data represents the distance between the image acquisition device and each object in the acquired target scene.

In this embodiment, the detection frame of the hand in each depth image is detected, and the detection and tracking of the hand in the multi-frame depth images are further realized through the detection frame of the hand. Wherein the hand in the depth image may be a hand of a real character or a virtual character.

In some embodiments, the hand detection may be performed on the first depth image through the target detection network, so as to obtain a detection frame of the hand of the first depth image. The target detection network can be obtained through sample image training, a detection frame of the hand is marked in the sample image, and the marking range of the detection frame comprises the area where the hand is located.

For example, feature extraction may be performed on the first depth image through the target detection network, where, taking two-dimensional image data included in the first depth image as RGB data as an example, the first depth image may include data of four dimensions, i.e., R data, G data, B data, and depth data, corresponding to the RGB data, and the data of the four dimensions is used as input data of the target detection network; performing feature extraction on input data through a target detection network, and determining the central point of a hand part and the height and width of a detection frame corresponding to the hand part in first depth data based on the extracted features; the detection frame of the hand is determined based on the center point of the hand and the height and width of the detection frame corresponding to the hand.

In some possible embodiments, feature extraction may be performed on each of the multiple frames of depth images through the target detection network, so as to obtain a detection frame of the hand in each frame of depth image based on the extracted features.

The target detection network may be implemented by a Convolutional Neural Network (CNN). For example, the target detection network may be a faster regional convolutional neural network (fast-RCNN).

Therefore, the embodiment of the disclosure obtains the detection frame of the hand by using the depth image detection, and then obtains the accurate three-dimensional coordinate data of the key point of the hand based on the depth image in the detection frame of the hand, thereby realizing the accurate detection of the three-dimensional coordinate of the hand; the method can further detect the hand in each depth image based on the target detection network to obtain the detection frame of the hand in each depth image, thereby obtaining accurate three-dimensional coordinate data of key points of the hand based on the depth image in the detection frame of the hand, and realizing hand detection and tracking by using the depth image.

In other possible embodiments, obtaining a detection frame of a hand of a first depth image of the multiple frames of depth images includes: in response to the condition that the first depth image is the first frame depth image in the multi-frame depth images, performing hand detection processing on the first depth image to obtain a detection frame of a hand of the first depth image; in response to the condition that the first depth image is a non-first-frame depth image in the multi-frame depth images, obtaining a detection frame of the hand in the first depth image based on a detection frame of the hand in the second depth image; the second depth image is a frame image before the first depth image.

In this embodiment, for a first frame depth image in a multi-frame depth image (that is, in response to a situation that a first depth image is the first frame depth image in the multi-frame depth image), performing hand detection on the first frame depth image through a target detection network to obtain a hand detection frame of the first frame depth image; the hand detection process for the first frame depth image may refer to the foregoing embodiment of determining a detection frame of a hand through a target detection network, and details are not repeated here. For a frame of depth image after the first frame of depth image (namely in response to the condition that the first depth image is a non-first frame of depth image in the multi-frame of depth images), determining a detection frame of the hand in the first depth image to be subjected to target detection based on the detection frame of the hand in a frame of depth image (namely, the second depth image) before the first depth image to be subjected to target detection.

It can be understood that, in this embodiment, feature extraction may be performed on a first frame depth image in a multi-frame depth image through a target detection network, so that a detection frame of a hand in the first frame depth image is obtained based on the extracted features; and tracking to obtain a hand detection frame in the depth image to be processed of the next frame based on the hand detection frame in the first frame of depth image or the three-dimensional coordinate data of the hand in the first frame of depth image.

Therefore, the embodiment of the disclosure obtains the detection frame of the hand by using the depth image detection, and then obtains the accurate three-dimensional coordinate data of the key point of the hand based on the depth image in the detection frame of the hand, thereby realizing the accurate detection of the three-dimensional coordinate of the hand; the method can further detect the hand in the first frame of depth image based on the target detection network to obtain a detection frame of the hand in the first frame of depth image, and obtain a detection frame of the hand in the subsequent depth image based on the detection frame of the hand in the first frame of depth image, without performing target detection on the complete image data of each frame of depth image, only performing target detection on the depth image in a certain area in the subsequent depth image to obtain a detection frame of the hand in each depth image, so that accurate three-dimensional coordinate data of key points of the hand is obtained based on the depth image in the detection frame of the hand, and detection and tracking of the hand using the depth image are realized on the basis of greatly reducing data processing amount.

In some optional embodiments of the present application, obtaining a detection frame of the hand in the first depth image based on the detection frame of the hand in the second depth image comprises: determining a first area based on an area where a detection frame of the hand in the second depth image is located; the first area is larger than the area where the detection frame is located; and determining a detection frame of the hand corresponding to the position range of the first area in the first depth image according to the first area.

In this embodiment, taking the example that the detection frame of the hand in the first frame depth image is determined as the detection frame of the hand in the next frame depth image (for example, the first depth image), the detection frame of the hand in the first frame depth image may be enlarged to obtain the first region. For example, if the detection frame of the hand of the first frame depth image is rectangular, assuming that the height of the detection frame is H and the width of the detection frame is W, the detection frame may extend in a direction away from the center point by taking the center point of the area where the detection frame is located as the center and taking four sides of the area where the detection frame is located; for example, if H/4 extends in the height direction and W/4 extends in the width direction, respectively, in the direction away from the center point, the first region can be represented by a rectangular region having a height of 3H/2 and a width of 3W/2, with the center point as the center, in the first frame depth image. Further, the detection frame of the hand corresponding to the position range in the subsequent frame depth image (i.e., the first depth image) may be determined based on the position range of the first region in the first frame depth image.

In some optional embodiments, determining, from the first region, a detection frame of the hand in the first depth image corresponding to the position range of the first region may include: performing limb key point detection processing on the depth image corresponding to the first area in the second depth image to obtain first key point information; the obtained first key point information represents the predicted key points of the hand; determining a first position range of the first key point information in a second depth image, and determining a second position range corresponding to the first position range in the first depth image, wherein the second position range is used as a prediction detection frame of the hand; and carrying out target detection processing on the depth image in the second position range in the first depth image to obtain a detection frame of the hand in the first depth image.

In some optional embodiments, determining, from the first region, a detection frame of the hand in the first depth image corresponding to the position range of the first region may include: determining a first position range of the first area in a second depth image; determining a second position range corresponding to the first position range in the first depth image; and carrying out target detection processing on the depth image in the second position range in the first depth image to obtain a detection frame of the hand in the first depth image.

In some embodiments, in response to that the first depth image is a non-first depth image in the multiple-frame depth images, obtaining the detection frame of the hand in the first depth image based on the detection frame of the hand in the second depth image may further include: and determining the detection frame of the hand in the first depth image based on a second depth image of the detection frame marked with the hand and the target tracking network, wherein the second depth image is a frame of image before the first depth image.

In this embodiment, the detection frame of the hand in the next frame image (i.e., the first depth image) can be determined by using the previous frame image (i.e., the second depth image) and the detection frame of the hand marked in the image through a pre-trained target tracking network. For example, the second depth image including the hand detection frame may be input to the target tracking network, and the hand detection frame in the first depth image may be obtained. The target tracking network may adopt any network structure capable of realizing target tracking, which is not limited in this embodiment.

In this embodiment, the target tracking network may be obtained by training a multi-frame sample image labeled with a position of a hand (e.g., a detection frame including the hand). For example, taking an example that the multi-frame sample image at least includes a first sample image and a second sample image, the first sample image may be processed by using a target tracking network, the first sample image is marked with a detection frame of a hand, and the processing result is a predicted position of the hand in the second sample image; the loss can be determined according to the predicted position and the labeled position of the hand in the second sample image, and the network parameters of the target tracking network can be adjusted based on the loss.

Therefore, under the condition that the first depth image is a non-first-frame depth image, the detection frame of the hand in each depth image is tracked, and then the hand key point detection is carried out based on the tracked depth image in the detection frame, so that the data to be processed is greatly reduced in the key point detection process of the hand, the data processing amount is reduced to a certain extent, and the detection and tracking of the hand using the depth image are realized.

In other optional embodiments, in response to the first depth image being a non-first depth image in the multiple-frame depth images, determining a first region based on a region in which a detection frame of the hand is located in the second depth image; the first area is larger than the area where the detection frame is located; the second depth image is a frame image before the first depth image; determining a first position range of the first area in a second depth image; determining a second position range corresponding to the first position range in the first depth image; and extracting features of the depth image in the second position range in the first depth image, and obtaining three-dimensional coordinate data of key points of the hand part based on the extracted features.

In this embodiment, taking the second depth image as the first frame depth image as an example, the detection frame of the hand in the first frame depth image may be determined by the target detection network, the detection frame of the hand is enlarged to obtain the first region, and the region range (i.e., the first position range) of the first region in the first frame depth image is used as the prediction range (i.e., the second position range) of the region in which the hand is located in the subsequent first depth image. And directly carrying out key point detection processing on the depth image in the second position range in the first depth image to obtain three-dimensional coordinate data of the key points of the hand in the first depth image.

Thus, by adopting the embodiment, the target detection is carried out on the first frame depth image without carrying out the target detection on the non-first frame depth image, the data processing steps are simplified, the data processing amount is reduced to a certain extent, the hand detection and tracking by using the depth image are realized, and the accurate detection of the three-dimensional coordinates of the hand is realized.

In some optional embodiments of the present application, obtaining three-dimensional coordinate data of key points of the hand based on the extracted features includes: obtaining two-dimensional image coordinate data and depth data of key points of the hand based on the extracted features; the two-dimensional image coordinate data is data in an image coordinate system; obtaining internal parameters of image acquisition equipment for acquiring multi-frame depth images; obtaining three-dimensional coordinate data of key points of the hand based on the two-dimensional image coordinate data, the depth data and the internal parameters; the three-dimensional coordinate data is data in a camera coordinate system.

In this embodiment, feature extraction may be performed on the depth image in the detection frame based on the key point detection network, and three-dimensional coordinate data of the key point of the hand is obtained based on the extracted features. In some alternative embodiments, as can be seen with reference to FIG. 2, the key points of the hand may include at least one of: wrist (Wrist) keypoints, joint keypoints of fingers, fingertip (TIP) keypoints of fingers, and the like; wherein, the key points of the joints of the fingers at least comprise at least one of the following points: metacarpophalangeal Point (MCP), Proximal Interphalangeal Point (PIP), and Distal Interphalangeal Point (DIP). The fingers may include at least one of: thumb (Thumb), Index finger (Index), Middle finger (Middle), Ring finger (Ring), Little finger (Little); as shown in FIG. 2, the wrist keypoints may include keypoints P₁(ii) a Thumb (Thumb) keypoints may include P₂、P₃、P₄At least one keypoint of; index finger (Index) keypoints can include P₅、P₆、P₇、P₈At least one keypoint of; middle finger (Middle) key points may include P₉、P₁₀、P₁₁、P₁₂At least one keypoint of; ring finger (Ring) key points may include P₁₃、P₁₄、P₁₅、P₁₆At least one keypoint of; little finger (Little) key points may include P₁₇、P₁₈、P₁₉、P₂₀At least one key point.

In this embodiment, as shown in fig. 3, the keypoint detection network may include a backbone network, a 2D branch network corresponding to the two-dimensional image coordinate data of the predicted hand, and a deep branch network corresponding to the depth data of the predicted hand. The backbone network can comprise a plurality of convolutional layers, and the depth image in the detection frame is subjected to convolution processing through the plurality of convolutional layers to obtain a feature map corresponding to the depth image; for example, the feature map obtained by processing the depth image in the detection frame through the backbone network may be a thermodynamic map; and further inputting the obtained feature maps into the 2D branch network and the deep branch network respectively.

In this embodiment, on the one hand, the feature map may be processed through a 2D branch network to obtain two-dimensional image coordinate data of a key point of the hand, for example, as shown in fig. 2, in the hand, where the two-dimensional image coordinate data represents two-dimensional coordinates in an image coordinate system. The image coordinate system is a two-dimensional rectangular coordinate system which is established on an imaging plane by taking the upper left corner of the two-dimensional image as a coordinate origin and respectively taking the horizontal direction and the vertical direction as an X axis and a Y axis. For example, the image coordinate system may be a rectangular coordinate system with pixels as units, and the abscissa u and the ordinate v of the pixel respectively represent the number of columns and the number of rows in the image. On the other hand, the feature map may be processed through a deep-branching network, and depth data of key points of the hand, for example, as shown in fig. 2, may be obtained. In some alternative embodiments, the deep branch network may be a fully connected network, and the processing of the fully connected network results in the depth data of the key points of the hand.

In this embodiment, the two-dimensional image coordinate data obtained through the key point detection network is data in an image coordinate system, and represents the position of the key point of the hand in the image, so that the coordinate data needs to be converted to obtain three-dimensional coordinate data of the key point of the hand in a camera coordinate system. The camera coordinate system is a three-dimensional rectangular coordinate system established by taking a focus center (or optical center) of a camera (namely, image acquisition equipment) as a coordinate origin, taking an X axis and a Y axis which are parallel to an image plane and taking an optical axis as a Z axis.

In some alternative embodiments, the first and second signals may be generated by: obtaining internal parameters of image acquisition equipment for acquiring multi-frame depth images, determining a conversion matrix based on the internal parameters, and converting the two-dimensional image coordinate data and the depth data through the conversion matrix to obtain three-dimensional coordinate data of key points of hands in a camera coordinate system. For example, the internal parameters of the image capturing device may include, but are not limited to, at least one of a position of a focus center (or optical center) of the image capturing device in an image coordinate system and a pixel focal length value of the image capturing device; the position of the focus center (or optical center) of the image capture device in the image coordinate system may also be understood as the coordinate of the origin of coordinates of the camera coordinate system in the image coordinate system. Illustratively, the size of the two-dimensional image data is: the height h and the width w are taken as examples, and the position of the focus center (or the optical center) of the image acquisition device in the image coordinate system can be recorded as (w/2, h/2).

For example, three-dimensional coordinate data of key points of the hand in the camera coordinate system can be obtained by referring to the following formula (1):

wherein the content of the first and second substances,

three-dimensional coordinate data representing key points of the hand in a camera coordinate system; wherein, (X, Y) represents the coordinates of the plane where the X axis and the Y axis are located in the camera coordinate system; z' represents data of a Z axis in a camera coordinate system; (u, v) 2D coordinates of key points of the hand in an image coordinate system, and z depth data; (u)₀，v₀) Representing the coordinates of the focus center (or optical center) of the image-capturing device in the image coordinate system, f_xAnd f_yAnd the focal length values of the pixels in the horizontal axis direction and the vertical axis direction of the image acquisition device are represented.

In this way, the embodiment of the present disclosure obtains the detection frame of the hand by using depth image detection, then obtains accurate three-dimensional coordinate data of the key point of the hand based on the depth image in the detection frame of the hand, specifically determines the conversion matrix between the image coordinate system and the camera coordinate system by using the internal parameters of the image capturing device (e.g., the position of the focus center (or optical center) of the image capturing device in the image coordinate system, the pixel focal length value of the image capturing device, etc.), and obtains accurate three-dimensional coordinate data of the key point of the hand in the camera coordinate system based on the conversion matrix, thereby realizing accurate detection of the three-dimensional coordinate of the hand and providing more accurate three-dimensional coordinate of the key point of the hand.

Based on the foregoing embodiment, the embodiment of the present disclosure further provides an image processing method. FIG. 4 is a second flowchart illustrating an image processing method according to an embodiment of the disclosure; as shown in fig. 4, the method includes:

step 201: performing hand detection processing on a first frame depth image in the multi-frame depth images to obtain a detection frame of a hand of the first frame depth image;

step 202: obtaining a detection frame of the hand in the first depth image based on the detection frame of the hand in the first frame depth image; the first depth image is a non-first frame depth image in the multi-frame depth images;

step 203: determining the center depth of the hand in the detection frame in the first depth image, and performing centering processing on the depth image in the detection frame based on the center depth to obtain a depth image after centering processing;

step 204: performing feature extraction on the depth image subjected to centering processing, and obtaining two-dimensional image coordinate data and depth data of key points of the hand based on the extracted features; the two-dimensional image coordinate data is data in an image coordinate system;

step 205: obtaining internal parameters of image acquisition equipment for acquiring multi-frame depth images;

step 206: obtaining three-dimensional coordinate data of key points of the hand based on the two-dimensional image coordinate data, the depth data and the internal parameters; the three-dimensional coordinate data is data in a camera coordinate system.

In this embodiment, the execution sequence of steps 201 to 206 is not limited to the above. Illustratively, at any step prior to step 206, the internal parameters of the image capture device that captures the multiple frames of depth images may be obtained.

The first depth image in this embodiment is a non-first-frame depth image in a multi-frame depth image, and may be a frame of depth image after the first-frame depth image, for example, a second-frame depth image, a third-frame depth image, and the like in the multi-frame depth image. The first depth image is a depth image of a detection frame for detecting a hand in a second frame subsequent to the first frame depth image, that is, a detection frame for detecting a hand in the first depth image is obtained based on the detection frame for detecting a hand in the first frame depth image.

The specific implementation manner of the detection frame for the first frame depth image and the hand in the first depth image in this embodiment may refer to the specific description of the foregoing embodiment, and details are not repeated here.

In this embodiment, after the detection frame of the hand in the first depth image is obtained, the depth image in the detection frame is centered. In some alternative embodiments, step 203 may comprise: determining the center depth of the hand in the detection frame, and performing centering processing on the depth image in the detection frame based on the center depth to obtain a depth image after centering processing, wherein the centering processing comprises the following steps: determining a center depth of a hand in the first depth image based on depth values of at least part of depth images within a detection box of the hand; and adjusting the depth value of the depth image in the detection frame of the hand by using the central depth of the hand to obtain the depth image after centering processing.

In this embodiment, in some implementations, a median of the depth values of the depth images within the detection box may be determined, with the median as the center depth of the hand; and subtracting the center depth from the depth value corresponding to each pixel in the depth image in the detection frame to obtain the depth image after centering processing. In further alternative embodiments, a median of the depth values of the hand regions in the depth image within the detection box may be determined, with the median as the center depth of the hand; and subtracting the center depth from the depth value corresponding to each pixel in the depth image in the detection frame to obtain the depth image after centering processing.

In other embodiments, the average value of the depth values of the depth images in the detection frame may be determined, and the average value may be used as the center depth of the hand; subtracting the center depth from the depth value corresponding to each pixel in the depth image in the detection frame to obtain a depth image subjected to centering processing; alternatively, a mean value of the depth values of the hand region in the depth image within the detection frame may be determined, the mean value being the center depth of the hand; and subtracting the center depth from the depth value corresponding to each pixel in the depth image in the detection frame to obtain the depth image after centering processing.

Therefore, through centered processing, convenience can be provided for subsequent data processing, and the difficulty and complexity of data processing are reduced.

In the present embodiment, specific reference may be made to the foregoing embodiment for the detailed process of step 204 to step 206, and the difference is that in the present embodiment, feature extraction may be performed on the depth image after centering processing based on the key point detection network, and two-dimensional image coordinate data and depth data of the key point of the hand may be obtained based on the extracted features.

Based on the foregoing embodiment, the method of the embodiment of the present application may further include: determining the posture of the hand based on the three-dimensional coordinate data of the hand; an interactive instruction is determined based on the hand gesture.

The gesture of the hand can be determined through the tracked hand and further based on the three-dimensional coordinate data of the hand, the corresponding interactive instruction is determined based on the gesture of the hand, and further the interactive instruction corresponding to each gesture can be responded.

The embodiment is suitable for a motion interaction scene, for example, a depth image including a hand can be acquired through an electronic device, a corresponding interaction instruction is determined according to a posture of the hand, and some functions of the electronic device can be executed, for example, in response to the interaction instruction, or the instruction is sent to other electronic devices, and some functions are executed by the electronic devices.

The embodiment is also suitable for various application scenes such as virtual reality, augmented reality or somatosensory games. The electronic device may be, for example, VR glasses, AR glasses, or the like. The method comprises the steps of collecting a depth image containing a hand through electronic equipment, determining a corresponding interactive instruction according to the posture of the hand, and responding to the interactive instruction, for example, executing corresponding actions aiming at various virtual objects in an AR scene, a VR scene or a somatosensory game scene.

The embodiment of the disclosure also provides an image processing device. FIG. 5 is a first schematic diagram illustrating a first exemplary configuration of an image processing apparatus according to an embodiment of the disclosure; as shown in fig. 5, the apparatus includes: a first processing unit 31 and a second processing unit 32; wherein the content of the first and second substances,

the first processing unit 31 described above, configured to obtain a detection frame of a hand of a first depth image in the multiple frames of depth images; the first depth image is any one frame depth image in the multi-frame depth images;

the second processing unit 32 is configured to extract features of the depth image in the detection frame, and obtain three-dimensional coordinate data of a key point of the hand based on the extracted features.

In some optional embodiments of the present disclosure, the first processing unit 31 is configured to, in response to that the first depth image is a first depth image in the multiple depth images, perform hand detection processing on the first depth image to obtain a detection frame of a hand of the first depth image; in response to the fact that the first depth image is a non-first-frame depth image in the multi-frame depth images, obtaining a detection frame of a hand in a first depth image based on a detection frame of the hand in a second depth image; the second depth image is a frame of image before the first depth image.

In some optional embodiments of the present disclosure, the first processing unit 31 is configured to determine a first area based on an area where a detection frame of the hand in the second depth image is located; the first area is larger than the area where the detection frame is located; and determining a detection frame of the hand corresponding to the position range of the first area in the first depth image according to the first area.

In some optional embodiments of the present disclosure, as shown in fig. 6, the apparatus further includes a third processing unit 33, configured to determine a center depth of the hand in the detection frame, and perform a centering process on the depth image in the detection frame based on the center depth to obtain a centered depth image;

the second processing unit 32 is configured to perform feature extraction on the depth image after the centering processing.

In some optional embodiments of the present disclosure, the third processing unit 33 is configured to determine a center depth of the hand based on depth values of at least part of the depth images within the detection frame of the hand in the first depth image; and adjusting the depth value of the depth image in the detection frame of the hand by using the central depth of the hand to obtain the depth image after centering processing.

In some optional embodiments of the present disclosure, the second processing unit 32 is configured to obtain two-dimensional image coordinate data and depth data of key points of the hand based on the extracted features; the two-dimensional image coordinate data is data in an image coordinate system; obtaining internal parameters of image acquisition equipment for acquiring the multi-frame depth images; obtaining three-dimensional coordinate data of key points of the hand based on the two-dimensional image coordinate data, the depth data and the internal parameters; the three-dimensional coordinate data is data in a camera coordinate system.

In some optional embodiments of the present disclosure, the apparatus further includes a fourth processing unit, configured to determine a posture of the hand based on the three-dimensional coordinate data of the hand; and determining an interactive instruction based on the posture of the hand.

In the embodiment of the present disclosure, the first processing Unit 31, the second processing Unit 32, the third processing Unit 33, and the fourth processing Unit in the image processing apparatus may be implemented by a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU) or a Programmable Gate Array (FPGA) in practical applications.

It should be noted that: the image processing apparatus provided in the above embodiment is exemplified by the division of each program module when performing image processing, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the processing described above. In addition, the image processing apparatus and the image processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

The embodiment of the disclosure also provides an electronic device. Fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure, as shown in fig. 7, the electronic device includes a memory 42, a processor 41, and a computer program stored in the memory 42 and executable on the processor 41, and when the processor 41 executes the computer program, the steps of the image processing method according to the embodiment of the present disclosure are implemented.

In this embodiment, the various components in the electronic device are coupled together by a bus system 43. It will be appreciated that the bus system 43 is used to enable communications among the components. The bus system 43 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 43 in fig. 7.

It will be appreciated that the memory 42 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 42 described in the embodiments of the present disclosure is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed by the embodiment of the present disclosure may be applied to the processor 41, or implemented by the processor 41. The processor 41 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 41. The processor 41 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 41 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present disclosure. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present disclosure may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in memory 42, where processor 41 reads the information in memory 42 and in combination with its hardware performs the steps of the method described above.

In an exemplary embodiment, the electronic Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field-Programmable Gate arrays (FPGAs), general purpose processors, controllers, Micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the foregoing methods.

In an exemplary embodiment, the disclosed embodiment further provides a computer readable storage medium, such as a memory 42, comprising a computer program, which is executable by a processor 41 of an electronic device to perform the steps of the aforementioned method. The computer readable storage medium can be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

The disclosed embodiments also provide a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the steps of the image processing method described in the disclosed embodiments.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated unit of the present disclosure may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An image processing method, characterized in that the method comprises:

obtaining a detection frame of a hand of a first depth image in the multi-frame depth images; the first depth image is any one frame depth image in the multiple frames of depth images;

and extracting features of the depth image in the detection frame, and obtaining three-dimensional coordinate data of key points of the hand part based on the extracted features.

2. The method of claim 1, wherein obtaining the detection frame of the hand of the first depth image of the plurality of frames of depth images comprises:

in response to the condition that the first depth image is the first frame depth image in the multiple frames of depth images, performing hand detection processing on the first depth image to obtain a detection frame of a hand of the first depth image;

in response to the situation that the first depth image is a non-first-frame depth image in the multi-frame depth images, obtaining a detection frame of a hand in a first depth image based on a detection frame of the hand in a second depth image; the second depth image is a frame image before the first depth image.

3. The method of claim 2, wherein obtaining the detection frame of the hand in the first depth image based on the detection frame of the hand in the second depth image comprises:

determining a first area based on an area where a detection frame of the hand in the second depth image is located; the first area is larger than the area where the detection frame is located;

and determining a detection frame of the hand corresponding to the position range of the first area in the first depth image according to the first area.

4. The method of any one of claims 1 to 3, wherein before performing feature extraction on the depth image in the detection frame, the method further comprises:

determining the center depth of the hand in the detection frame, and performing centering processing on the depth image in the detection frame based on the center depth to obtain a centered depth image;

the feature extraction of the depth image in the detection frame comprises the following steps:

and performing feature extraction on the depth image subjected to the centralization treatment.

5. The method of claim 4, wherein determining the center depth of the hand within the detection frame comprises:

determining a center depth of a hand in the first depth image based on depth values of at least part of depth images within a detection box of the hand;

the centering processing is performed on the depth image in the detection frame based on the center depth to obtain the depth image after centering processing, and the centering processing comprises the following steps:

and adjusting the depth value of the depth image in the detection frame of the hand by using the central depth of the hand to obtain the depth image after centering processing.

6. The method according to any one of claims 1 to 5, wherein the obtaining three-dimensional coordinate data of key points of the hand based on the extracted features comprises:

obtaining two-dimensional image coordinate data and depth data of key points of the hand based on the extracted features; the two-dimensional image coordinate data is data in an image coordinate system;

obtaining internal parameters of image acquisition equipment for acquiring the multi-frame depth images;

obtaining three-dimensional coordinate data of key points of the hand based on the two-dimensional image coordinate data, the depth data and the internal parameters; the three-dimensional coordinate data is data in a camera coordinate system.

7. The method according to any one of claims 1 to 6, further comprising:

determining a pose of a hand based on the three-dimensional coordinate data of the hand;

determining an interactive instruction based on the gesture of the hand.

8. An image processing apparatus, characterized in that the apparatus comprises: a first processing unit and a second processing unit; wherein the content of the first and second substances,

9. The apparatus according to claim 8, wherein the first processing unit is configured to perform hand detection processing on the first depth image to obtain a detection frame of a hand of the first depth image in response to the first depth image being a first frame depth image of the plurality of frames of depth images; in response to the situation that the first depth image is a non-first-frame depth image in the multi-frame depth images, obtaining a detection frame of a hand in a first depth image based on a detection frame of the hand in a second depth image; the second depth image is a frame image before the first depth image.

10. The apparatus of claim 9, wherein the first processing unit is configured to determine a first region based on a region in which a detection frame of a hand in the second depth image is located; the first area is larger than the area where the detection frame is located; and determining a detection frame of the hand corresponding to the position range of the first area in the first depth image according to the first area.

11. The apparatus according to any one of claims 8 to 10, further comprising a third processing unit, configured to determine a center depth of the hand in the detection frame, perform centering processing on the depth image in the detection frame based on the center depth, and obtain a centered depth image;

12. The apparatus according to claim 11, wherein the third processing unit is configured to determine a center depth of the hand based on depth values of at least some depth images within a detection box of the hand in the first depth image; and adjusting the depth value of the depth image in the detection frame of the hand by using the central depth of the hand to obtain the depth image after centering processing.

13. The apparatus according to any one of claims 8 to 12, wherein the second processing unit is configured to obtain two-dimensional image coordinate data and depth data of key points of the hand based on the extracted features; the two-dimensional image coordinate data is data in an image coordinate system; obtaining internal parameters of image acquisition equipment for acquiring the multi-frame depth images; obtaining three-dimensional coordinate data of key points of the hand based on the two-dimensional image coordinate data, the depth data and the internal parameters; the three-dimensional coordinate data is data in a camera coordinate system.

14. The apparatus according to any one of claims 8 to 13, further comprising a fourth processing unit for determining a pose of a hand based on the three-dimensional coordinate data of the hand; determining an interactive instruction based on the gesture of the hand.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.