CN114066814A

CN114066814A - Gesture 3D key point detection method of AR device and electronic device

Info

Publication number: CN114066814A
Application number: CN202111218181.1A
Authority: CN
Inventors: 朱铭德; 丛林
Original assignee: Hangzhou Yixian Advanced Technology Co ltd
Current assignee: Hangzhou Yixian Advanced Technology Co ltd
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2022-02-18

Abstract

The application relates to a gesture 3D key point detection method of AR equipment, which comprises the following steps: the method comprises the steps of obtaining a first image and a second image through a binocular camera, obtaining rough 3D key points of the first image through a detection network or a previous frame image, further obtaining more accurate optimized 2D key point information through an optimization network, and calculating monocular 3D key points according to the optimized 2D key points. And projecting the monocular 3D key points onto the second image, acquiring the relative depth information and the 2D key point coordinates of the second image through an optimization network, and finally, fusing the 2D key point coordinates of the first image and the second image to calculate the binocular gesture 3D key points and the palm size. By the method, the problem of poor accuracy of the gesture 3D key point prediction result in the related technology is solved, and both accuracy and efficiency are considered.

Description

Gesture 3D key point detection method of AR device and electronic device

Technical Field

The application relates to the field of augmented reality, in particular to a gesture 3D key point detection method of an AR device and an electronic device.

Background

With the development of Augmented Reality (AR) technology, gesture interaction becomes an indispensable ring thereof. The user can realize interactive operations such as clicking, selecting, moving, zooming and the like on the 3D object in the augmented reality scene through some natural gestures conforming to habits.

In the related art, for AR devices, a common detection method of gesture 3D key points is as follows:

1. the gesture key points are predicted by using the depth camera module with a larger view field, but the AR equipment adopting the scheme generally has the problem of larger power consumption and higher cost;

2. and acquiring an image through a monocular camera, and estimating 3D key points by combining the monocular image with a preset human hand model or human hand size. In the method, if the interactive operation is to be realized, the size of the palm of the user needs to be measured in advance, which is contrary to the original purpose of natural interaction, and the steps are complicated.

3. The method comprises the steps of acquiring two images by using a binocular camera, running a 2D algorithm on the two images to obtain 2D key points, and calculating the 3D key points by combining internal and external parameters of the binocular camera. However, when the method is complicated, such as occlusion, the two acquired images have a certain degree of 2D error, which causes a larger error of the finally output 3D key point, so that the problems of hand shape deformation and sudden change of the position of the operated virtual object easily occur. In addition, since the method needs to run the algorithm twice for each frame of image, it cannot give consideration to both efficiency and precision.

At present, no effective solution is provided for the problem of poor accuracy of the gesture 3D key point prediction result in the related art.

Disclosure of Invention

The embodiment of the application provides a gesture 3D key point detection method of an AR device and an electronic device, and aims to at least solve the problem that the accuracy of a gesture 3D key point prediction result in the related art is poor.

In a first aspect, an embodiment of the present application provides a gesture 3D key point detection method for an AR device, the method including:

step 101: acquiring a first image and a second image through a binocular camera, wherein the first image and the second image are respectively shot by a first shooting module and a second shooting module of the binocular camera at the same moment;

step 102: inquiring whether a previous frame image containing palm information exists, if so, acquiring rough 2D key point information of a palm in the first image based on the previous frame image, and if not, acquiring rough 2D key point information of the palm in the first image through a key point detection network;

step 103: performing a preprocessing step on the rough 2D key point information to obtain an input image and input key point information for optimizing a network, and acquiring relative depth information of the first image and coordinates of the optimized 2D key point through the optimization network;

step 104: calculating monocular 3D key point coordinates of the first image according to the size information of the palm, the relative depth information of the first image and the optimized 2D key point coordinates;

step 105: judging whether the frame sequence of the first image is an integral multiple of a preset frame interval, if so, executing step 106, and if not, executing step 108;

step 106: projecting the monocular 3D key point coordinates to the second image, and acquiring relative depth information of the second image and optimized 2D key point coordinates through the optimization network;

step 107: calculating binocular 3D key point coordinates by a least square method based on a preset optimization problem, the relative depth information of the second image and the optimized 2D key point coordinates, and updating the size information of the palm according to the binocular 3D key point coordinates;

step 108: initializing the previous frame image according to the binocular 3D key point coordinates or the monocular 3D key point coordinates, and re-executing the steps 102 to 108.

In some embodiments, the obtaining rough 2D keypoint information of the palm in the first image through the keypoint detection network comprises:

extracting the characteristics of the first image by the backbone network to obtain a down-sampled characteristic diagram;

converting the characteristic diagram into an output layer with preset dimensionality;

and analyzing the output layer, and acquiring the left-right hand information and the rough 2D key point information in the first image according to the confidence coefficient.

In some embodiments, the obtaining the relative depth information of the first image and the optimized 2D keypoint coordinates through the optimization network comprises:

acquiring the input image, and preliminarily extracting the characteristics of the input image to obtain a first initial characteristic diagram;

acquiring the input key point information, rendering according to Gaussian distribution based on the input key point information to obtain a second initial feature map, and splicing the second initial feature map and the first initial feature map at a channel layer to obtain a first fusion feature map;

and based on the first fusion feature map, performing feature extraction and convolution operation through a backbone network to obtain relative depth information of the first image and an optimized 2D key point coordinate.

In some embodiments, the obtaining the relative depth information of the first image and the optimized 2D keypoint coordinates by performing feature extraction and convolution operations through a backbone network includes:

the backbone network extracts the features in the first fusion feature map to obtain a second fusion feature map;

the second fusion feature map is subjected to convolution processing of a first preset dimension to obtain a first key point feature map, and the optimized 2D key point coordinate of the first image is calculated according to the first key point feature map;

further extracting features of the second fused feature map to obtain a second key point feature map, wherein the second key point feature map has the same size as the first key point feature map;

multiplying the first keypoint feature map by the second keypoint feature map, the result of the multiplication being averaged and pooled to obtain relative depth information of the first image, and,

adding convolution of a second preset dimension to the second fusion feature map to obtain a one-dimensional score of the second fusion feature map;

and judging whether the one-dimensional score is larger than a preset score threshold value, if so, calculating the coordinates of the optimized 2D key point continuously by using the palm robust, otherwise, losing the palm, and executing the step 102.

In some embodiments, the projecting the monocular 3D keypoint coordinates to the second image and obtaining the relative depth information and optimized 2D keypoint coordinates of the second image over the optimization network comprises:

projecting the monocular 3D key point coordinate to the second image according to the external parameters of the first camera module and the second camera module;

acquiring a 2D key point coordinate in the second image, and performing a preprocessing step on the 2D key point coordinate to obtain an input image and input key point information for the optimized network;

acquiring an input image of the second image, and preliminarily extracting the characteristics of the input image to obtain a first initial characteristic diagram;

rendering according to Gaussian distribution to obtain a second initial feature map based on the input key point information of the second image, and splicing the second initial feature map and the first initial feature map at a channel layer to obtain a first fusion feature map;

and based on the first fusion feature map, performing feature extraction and convolution operation through a backbone network to obtain relative depth information of the second image and an optimized 2D key point coordinate.

In some embodiments, the performing a preprocessing step on the coarse 2D keypoint information to obtain an input image and input keypoint information for optimizing a network includes:

calculating a key point bounding box according to the rough 2D key point information;

determining the side length of a square according to the size of the key point bounding box and the pixel distance between the key points;

intercepting a square area according to the side length of the square and the center of the key point bounding box to obtain an intercepted image;

zooming the intercepted image to a preset size to obtain an input image for the optimized network;

and calculating the coordinates of the key points in the intercepted image to obtain the input key point information for the optimized network.

In some of these embodiments, said calculating 3D keypoint coordinates of said first image from said size information, relative depth information of said first image and optimized 2D keypoint coordinates comprises:

calculating the absolute position of the palm according to the size information, the optimized 2D key point coordinates and the internal reference information of the first camera module:

calculating the monocular 3D keypoint coordinates according to the absolute position, the relative depth information of the first image and the optimized 2D keypoint coordinates by the following formula:

wherein the content of the first and second substances,

is the 3D keypoint coordinates, D is the absolute position, D is the relative depth information of the first image, k_x、k_y、c_x、c_yRespectively, are internal parameters of the first camera module.

In some embodiments, the initializing the previous frame image according to the binocular 3D keypoint coordinates or the monocular 3D keypoint coordinates comprises:

directly initializing the previous frame image according to the binocular 3D key point coordinates or the monocular 3D key point coordinates, or,

and predicting the change information of the next frame image according to the binocular 3D key point coordinates or the monocular 3D key point coordinates, and initializing the previous frame image by combining the change information.

In some embodiments, the updating the size information of the palm according to the binocular 3D keypoint coordinates includes:

and calculating the size information of the palm according to the binocular 3D key points, and filtering the size information based on a preset filtering rule to update the size information of the palm.

In a second aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements the method for detecting a gesture keypoint of an AR device according to the first aspect.

Compared with the related art, the gesture 3D key point detection method of the AR device provided by the embodiment of the application has the following beneficial effects:

1. by continuously optimizing the optimization network on the basis of the known rough 2D key points, the optimized 2D key points with higher accuracy can be calculated.

2. Through projecting the monocular 3D result to another image and matching with an optimization network, the binocular 2D key points can be ensured to be as close as possible, further, the abnormal situation can not occur in the finally obtained 3D key points, the hand shape is more stable, and the continuous and stable interaction requirements of the hand shape can be met.

3. The binocular algorithm does not need to be operated on each frame of image, the binocular algorithm can be flexibly set to be operated once every N frames according to the hardware performance, and only the monocular algorithm is operated under other conditions. On the premise of ensuring the precision, the efficiency is considered and the power consumption is reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic view of an application scenario of a gesture 3D key point detection method of an AR device according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for detecting a gesture key of an AR device according to an embodiment of the present application;

FIG. 3 is a flow chart of coarse 2D keypoints acquisition by detecting a network according to an embodiment of the application;

FIG. 4 is a schematic illustration of a heatmap according to an embodiment of the present application;

FIG. 5 is a schematic illustration of a palm keypoint according to an embodiment of the present application;

FIG. 6 is a flow diagram of optimizing a network according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a second initial feature map according to an embodiment of the present application;

fig. 8 is an internal structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The gesture 3D key point detection method for the AR device provided by the embodiment of the present application can be applied to an application scene shown in fig. 1, where fig. 1 is an application scene schematic diagram of the gesture 3D key point detection method for the AR device according to the embodiment of the present application, and as shown in fig. 1, after the initialization of the AR device is completed, an augmented reality scene is displayed in a visual field range of a user, the AR device obtains an interaction gesture of the user through a camera device, generates an operation instruction in the augmented reality scene according to the interaction gesture, and controls the AR device to work based on the operation instruction. In the whole interaction process, detection of the gesture key points is one of important links, the two pieces of picture information of the left camera and the right camera are fully fused based on a binocular camera algorithm, and more accurate gesture key points can be acquired, so that the control stability of the AR equipment is improved, and the efficiency and the precision are compatible.

Fig. 2 is a flowchart of a method for detecting a gesture key point of an AR device according to an embodiment of the present application, where as shown in fig. 2, the flowchart includes the following steps:

s201: acquiring a first image and a second image through a binocular camera, wherein the first image and the second image are shot by a first camera module and a second camera module of the binocular camera at the same moment respectively;

wherein, this binocular camera can set up on head-mounted display device, also can set up on the hardware equipment that possesses AR operational environment such as smart mobile phone and panel computer.

S202: inquiring whether a previous frame image containing palm information exists, if so, acquiring rough 2D key point information of a palm in the first image based on the previous frame image, and if not, acquiring rough 2D key point information of the palm in the first image through a key point detection network;

it should be noted that in some cases, the binocular camera may not have a previous frame image, for example, the previous frame image may not exist when the binocular camera is turned on for the first time. In addition, in some cases, palm information may not be included in the previous frame image even if the previous frame image exists, for example, the palm of the user is off the field of view of the camera halfway.

This embodiment can be flexibly implemented with respect to the above-described case. The method comprises the following steps of judging whether a previous frame image containing palm information exists at present or not, and if the previous frame image exists at present, acquiring a rough palm 2D key point by combining the previous frame image and the current image; and if the current image does not exist, detecting and acquiring the rough 2D key points of the current image through the detection network.

S203: performing a preprocessing step on the rough 2D key point information to obtain an input image and input key point information for optimizing a network, and acquiring relative depth information of the first image and coordinates of the optimized 2D key point through the optimization network;

in this embodiment, the purpose of performing preprocessing on the coarse 2D keypoint information is to convert the information into a form that can be identified and processed by an optimized network;

the optimization network carries out operation and prediction based on the input image and the input key point information, and can obtain more accurate optimized 2D key points and relative depth information on the basis of original rough 2D key points.

S204: calculating monocular 3D key point coordinates of the first image according to the size information of the palm, the relative depth information of the first image and the optimized 2D key point coordinates;

the size information of the palm is determined according to the distance between the key points, and can also be manually specified.

S205: judging whether the frame sequence of the first image is an integral multiple of a preset frame interval, if so, executing a step S206, and if not, executing a step S208;

it can be understood that, when step S206 is sequentially performed, i.e., on the basis of the monocular 3D keypoints, binocular 3D keypoints continue to be calculated in combination with the second image, and finally, the binocular 3D result is used for keypoint tracking; and when the jump executes step S208, the key point tracking is performed directly using the monocular 3D result.

In the conventional binocular scheme, a binocular 2D algorithm is operated once on the left picture and the right picture to obtain a 2D result in each frame of image, and then the 2D results are fused to obtain a 3D result, so that the conventional binocular scheme cannot reduce frames and is poor in efficiency.

In this embodiment, whether to run the 3D algorithm on the 3D result of the first image and further in combination with the second image depends on the frame sequence of the first image and the preset frame interval. In addition, the preset frame interval can be flexibly set according to the computing capability of the hardware equipment, and can be set to be a smaller numerical value, such as 1, when the computing capability is very high; otherwise, the larger data, e.g., 5, is set. Therefore, the frame-dropping operation binocular algorithm can be realized, and the purpose of giving consideration to both precision and efficiency is achieved

S206: projecting the monocular 3D key point coordinates to a second image according to external parameters between the first camera module and the second camera module, and acquiring relative depth information of the second image and optimizing 2D key point coordinates through an optimization network;

it should be noted that, because the 2D key points of the second image are initialized by using the 3D result of the first image, the information of the two images is fully fused; therefore, the resulting 2D result will tend to coincide with the hand shape of the first image, and the actual calculated 3D error will be much smaller.

S207: calculating coordinates of binocular 3D key points by a least square method based on a preset optimization problem, relative depth information of a second image and optimized 2D key point coordinates, and updating size information of the palm according to the coordinates of the binocular 3D key points;

the construction of the preset optimization problem is visual, namely a group of 3D points are calculated, so that the pixel error projected to the binocular is as small as possible, the relative error estimated by the binocular is also as small as possible, and the optimization target is as follows: e ═ E_{pix_err}+αE_{depth_err}Wherein E is_{pix_err}And E_{depth_err}The pixel error and the depth error are respectively, and alpha is used for controlling the weight between the pixel error and the depth error and can be weighed according to actual conditions.

In addition, because the more accurate binocular 3D key points are recalculated, the palm size information obtained by updating according to the binocular 3D key points is more accurate.

S208: initializing the previous frame image in step S202 according to the binocular 3D key point coordinates or the monocular 3D key point coordinates, and re-executing steps S202 to S208.

It should be noted that, corresponding to the above step S205, in the case that only the monocular algorithm is run on the first image, the keypoint tracking and the initialization of the subsequent previous frame image are performed according to the monocular 3D result (i.e., the monocular 3D keypoint coordinates in step S204); under the condition that the binocular algorithm is continuously operated in combination with the second image, the key point tracking is performed according to the binocular 3D result (i.e., the coordinates of the binocular 3D key point in step S207) and the subsequent previous frame image is initialized.

In some embodiments, if there is no previous image including palm information, rough 2D keypoint information of the palm in the first image is obtained through the keypoint detection network, and fig. 3 is a flowchart of obtaining rough 2D keypoints through the detection network according to an embodiment of the present application, as shown in fig. 3, the flowchart includes the following steps:

s301, a first image is acquired and scaled to a uniform size, for example 256 × 192 × 1.

S302, extracting the characteristics of the first image by the backbone network to obtain a down-sampled characteristic diagram; the backbone network may be common modules such as Resnet, MobileNet, Pelee, and the like, or stacks of some custom convolutions, which have sufficient capability of extracting features. Further, the down-sampled feature size obtained through the backbone network is W × H × C, for example, 32 times down-sampled, and the feature size is 8 × 6 × 256.

S303, converting the characteristic diagram into an output layer with a preset dimension; where convolution of 1 x C (2 x K +2) can be used to bring the output layer to the desired dimension of W x H (2+2 x K), where K is the number of keypoints.

S304, analyzing the output layer, and obtaining left-right hand information and rough 2D key point coordinates in the first image according to the confidence degrees of all dimensions.

Wherein, the first two dimensions of the output layer channel respectively represent the confidence of the hand and the confidence of the left-right hand classification, the last 2 × K dimensions represent the position information of each 2D key point, optionally, the specific analysis steps are as follows:

the first step is as follows: traversing first positions on all channels of W x H, and finding out positions with all scores larger than a set threshold (for example, 0.5), wherein the positions are palm areas;

the second step is that: and judging the left hand and the right hand according to the second position of the channel, wherein if the position is more than 0, the left hand is the right hand, and otherwise, the left hand is the left hand. In the embodiment, only one group of data with the highest score is reserved for the left hand and the right hand, so that the time consumption of analysis can be greatly reduced by processing;

the third step: coarse 2D keypoints of the first image are calculated. Assuming that the down-sampling rate is F, the output Heatmap is denoted as H. Fig. 4 is a schematic diagram of a heatmap according to an embodiment of the present application, and as shown in fig. 4, the data of the ith column and the jth row are processed to calculate the coordinates of the kth key point according to the following formula 1:

equation 1:

wherein i, j and k respectively represent the ith column and the jth row in the Heatmap, and the kth key point on the jth row of the ith column,

is the coordinate of the kth key point on the jth line of the ith column, F is the preset down-sampling magnification, and H represents Heatmap.

It should be noted that, at most, only two hands of the user are present in the AR scene, and the two hands are left and right hands, respectively. Therefore, the above characteristics are used as a priori knowledge of the detection network, and based on the priori knowledge, the detection network only needs to find the first image with the highest matching score with the left hand and the right hand, and does not need to execute an NMS step with complex steps in the detection network, so that the efficiency can be improved well, and the time consumption can be reduced.

In some embodiments, if there is a previous image including palm information, acquiring a rough 2D keypoint of the palm in the first image based on the previous image, and the specific process includes:

according to the internal reference of the first camera module, projecting the 3D key points in the previous frame image onto the first image to obtain 21 rough 2D key points, wherein the coordinates of the rough 2D key points are calculated according to the following formula 2:

equation 2:

wherein, K_rThe first camera module is internally referred, and X, Y and Z respectively represent numerical values on each coordinate axis.

In some embodiments, before inputting into the optimized network, preprocessing is further performed on the coarse 2D keypoints to obtain input images and input keypoint information that can be identified by the optimized network, where the preprocessing includes the following steps:

the first step is as follows: constructing a bounding box of 21 key points according to the rough 2D key point information, wherein optionally, the width and the height of the bounding box of the key points are w and h respectively, and the vertex coordinate of the upper left corner is (x)_tl，y_tl)；

The second step is that: taking the center of the key point bounding box as a center, cutting out a square area with square side length s, and filling the square area with pure black if the square area exceeds the image, wherein the square side length s is calculated by the following formula 3:

equation 3: s ═ max { w · α₁，h*α₁，dis_{0_9}*α₂}

Wherein dis_{0_9}Is the pixel distance between keypoints 0 to 9 as shown in fig. 5, fig. 5 is a schematic diagram of a palm keypoint according to an embodiment of the present application; alpha is alpha₁And alpha₂Is a parameter for controlling the preset extension range of the hand, which can be selected according to the actual situation, and can be generally selected to be 1.4 and 2.6 respectively. By this method, it is possible to cope with various hand shapes and to secure a hand region having a reasonable size.

The third step: scaling the truncated image to m x m (e.g., 128) yields the input image for the optimized network, m being the input size required to optimize the network.

The fourth step: calculating the coordinates of each 2D key point on the input image by the following formula 4, wherein the 2D key point coordinates are input key point information of the optimized network:

equation 4:

wherein the content of the first and second substances,

representing the input key points for optimizing the network,

representing the original coarse 2D keypoints.

In some embodiments, fig. 6 is a flowchart of an optimization network according to an embodiment of the present application, and as shown in fig. 6, acquiring the relative depth information of the first image and optimizing the 2D keypoint coordinates through the optimization network includes:

the first step is as follows: obtaining an input image with m × m size, and preliminarily extracting features to obtain a first initial feature map (Heatmap1), wherein the Heatmap1 has the size of m₂*m₂C, wherein m₂May be equal to m or 0.5/0.25 times m, and c represents the number of channels, which may be 32.

The second step is that: obtaining input key point information (21 2D key points kpts), and respectively obtaining m at m based on the input key points₂*m₂The data of 21 was rendered with a gaussian distribution to get a second initial feature map (heatmap 2). FIG. 7 is a schematic diagram of a second initial signature according to an embodiment of the present application, shown in FIG. 7, with a greater response at a strategically corresponding location. In addition, other constants of the gaussian distribution can be adjusted and set by themselves, and have no central influence on the invention point of the application, so that the detailed description is omitted.

The third step: splicing Heatmap1 and Heatmap2 at a channel layer to form m₂*m₂First fused feature map (Heatmap) of (c +21) dimension. It should be noted that, the image information and the key point information are fused by splicing the channel layers.

The fourth step: extracting the features of the first fusion feature map by the backbone network to obtain m₃*m₃*c₃A second fused feature map of dimensions (Heatmapf), where the backbone network may be Pelee, MobileNet, Hourglass, etc.

The fifth step: the second fused profile (Heatmapf) is passed through the firstPredetermined dimension (1 x 1 c)₃21) to obtain m₃*m₃21-dimensional first keypoint feature graph H_kpt. Based on the first key point feature map H_kptCalculating the kth optimized 2D keypoint coordinates by the following formula:

equation 5:

the coordinate accuracy calculated by adding equation 5 above is higher and smoother than the method of directly taking the maximum value in the related art.

And a sixth step: and further performing feature extraction on the second fusion feature map to obtain a second key point feature map with the same size, multiplying the first key point feature map and the second key point feature map to obtain a new key point feature map, and performing average pooling on the multiplication result to obtain 1 x 21-dimensional relative depth information. Wherein the multiplication of the first keypoint feature map with the second keypoint feature map is similar to the "attention mechanism" trick, which can make the optimization network focus more on the features of the 2D keypoint locations.

In some embodiments, in consideration of the fact that tracking instability such as movement of the palm out of the camera view range may occur in the process of using the AR device by the user, in the process of acquiring the relative depth information of the first image and optimizing the 2D keypoint coordinates through the backbone network, it is further necessary to determine whether the palm is robust, and the process specifically includes the following steps:

after global pooling, the second fused feature map (heatmapf) is added with a second predetermined dimension 1 x c₃Convolution of 1 to obtain one-dimensional score of the second fused feature map;

judging whether the one-dimensional score is larger than a preset score threshold value, such as 0.5, if so, considering the palm to be robust, and continuing to run subsequent logic; if not, the palm is considered to be lost, and the step S202 is skipped to detect the palm key points again.

In some embodiments, after obtaining the optimized 2D keypoints, further converting the optimized 2D keypoints into monocular 3D keypoints, wherein calculating monocular 3D keypoint coordinates according to the palm size information, the relative depth information, and the optimized 2D keypoint coordinates comprises:

calculating the 3D result by using the size information of the palm, and determining the size information of the palm, wherein the size information of the palm can take the length between two key points on the palm, such as

key points

0 and 9 in FIG. 5, as the size information length of the hand_palm(ii) a In addition, for the first frame palm image currently acquired by the camera, because the first frame palm image is initialized by adopting the binocular 3D result stored in the system, the size of the palm is not automatically calculated, in this case, a reasonable numerical value can also be manually specified as the size information of the palm, for example, 8.5cm, and the first frame palm image is automatically updated according to the palm image;

further, according to the size information, the coordinates of the optimized 2D key points and the internal parameters of the first camera module, the absolute position D of the palm is calculated by the following formula 6:

equation 6:

wherein x is_i，y_i2D coordinates, k, of the index i_x，k_y，_cx，c_yIs the internal reference information of the right camera.

Finally, the 3D keypoint coordinates are calculated by the following formula 7 according to the absolute position, the relative depth information of the first image and the optimized 2D keypoint coordinates:

equation 7:

wherein the content of the first and second substances,

is the 3D keypoint coordinates, D is the absolute position, D is the relative depth information of the first image，k_x、k_y、c_x、c_yIs an internal reference of the first camera module.

In some embodiments, projecting the monocular 3D keypoints to the second image after projecting the monocular 3D keypoints to the second image according to the external reference between the first camera module and the second camera module, and acquiring the relative depth information of the second image and optimizing the 2D keypoint coordinates through the optimization network includes:

projecting the monocular 3D key point coordinates to a second image according to the external parameters of the first camera module and the second camera module; acquiring rough 2D key point coordinates in the second image, and performing a preprocessing step on the 2D key point coordinates to obtain an input image and input key point information for optimizing a network; acquiring an input image of a second image, and preliminarily extracting the characteristics of the input image to obtain a first initial characteristic diagram; based on the input key point information of the second image, rendering according to Gaussian distribution to obtain a second initial feature map, and splicing the second initial feature map and the first initial feature map at a channel layer to obtain a first fusion feature map; based on the first fusion feature map, performing feature extraction and convolution operation through a backbone network to obtain relative depth information of a second image and an optimized 2D key point coordinate;

it should be noted that the implementation process of "obtaining the relative depth information and the optimized keypoint coordinates of the second image through the optimized network" is similar to the implementation logic of "obtaining the relative depth information and the optimized keypoint information of the first image through the optimized network" in the foregoing embodiment, and a person skilled in the art can implement "obtaining the relative depth information and the optimized keypoint coordinates of the second image" with reference to the foregoing embodiment without creative work, and therefore, this step is not described again in this application.

In some embodiments, after the more accurate coordinates of the binocular 3D key points are acquired, the palm size information needs to be updated according to the coordinates of the binocular 3D key points. Wherein, the length can be directly recalculated according to the distance between the coordinates of the binocular 3D key points_palm(ii) a In addition, in calculating length_palmIn time, according to the preset filter gaugeThen, the size of the palm is filtered greatly, so that the palm size is smooth and the movement of the hand is not influenced.

In some of these embodiments, initializing the previous frame image according to the binocular 3D keypoint coordinates or the monocular 3D keypoint coordinates comprises:

directly initializing a previous frame image according to the coordinates of the binocular 3D key points or the coordinates of the monocular 3D key points;

and predicting the change information of the lower frame image according to the coordinates of the binocular 3D key points or the coordinates of the monocular 3D key points, and initializing the previous frame image by combining the change information.

The change information may be a palm offset, and when the previous frame image is initialized, the offset is added to the existing 3D key point coordinates for prediction; further, it is also possible to predict the change of the entire palm from parameters such as the rotation speed and translation speed of the palm, and initialize the next frame image in accordance with the change.

In some of these embodiments, an electronic device is provided, which may be a terminal. The electronic device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of gesture 3D keypoint detection for an AR device. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

In an embodiment, fig. 8 is a schematic internal structure diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 8, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 8. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capabilities, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing an environment for an operating system and running of a computer program, the computer program is executed by the processor to realize the gesture 3D key point detection method of the AR device, and the database is used for storing data.

Those skilled in the art will appreciate that the structure shown in fig. 8 is a block diagram of only a portion of the structure relevant to the present disclosure, and does not constitute a limitation on the electronic device to which the present disclosure may be applied, and that a particular electronic device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.

It should be understood by those skilled in the art that various features of the above embodiments can be combined arbitrarily, and for the sake of brevity, all possible combinations of the features in the above embodiments are not described, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A gesture 3D keypoint detection method of an AR device, the method comprising:

2. The method of claim 1, wherein the obtaining rough 2D keypoint information of the palm in the first image by the keypoint detection network comprises:

3. The method of claim 2, wherein the obtaining relative depth information of the first image and optimizing 2D keypoint coordinates through an optimization network comprises:

4. The method of claim 3, wherein the obtaining the relative depth information and the optimized 2D keypoint coordinates of the first image by performing feature extraction and convolution operations through a backbone network comprises:

5. The method of claim 1, wherein projecting the monocular 3D keypoint coordinates to the second image and obtaining relative depth information and optimized 2D keypoint coordinates of the second image via the optimization network comprises:

projecting the 3D monocular key point coordinate to the second image according to the external parameters of the first camera module and the second camera module;

acquiring rough 2D key point coordinates in the second image, and performing a preprocessing step on the 2D key point coordinates to obtain an input image and input key point information for the optimized network;

6. The method of claim 1, wherein the performing a preprocessing step on the coarse 2D keypoint information to obtain an input image and input keypoint information for optimizing a network comprises:

7. The method of claim 1, wherein calculating 3D keypoint coordinates of the first image from the size information, relative depth information of the first image, and optimized 2D keypoint coordinates comprises:

wherein the content of the first and second substances,

8. The method of claim 1, wherein the initializing the previous frame image according to the binocular 3D keypoint coordinates or the monocular 3D keypoint coordinates comprises:

9. The method of claim 1, wherein the updating the size information of the palm according to the binocular 3D keypoint coordinates comprises:

10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements a gesture keypoint detection method for an AR device as claimed in any one of claims 1 to 9.