CN116631011B

CN116631011B - Hand gesture estimation method and electronic equipment

Info

Publication number: CN116631011B
Application number: CN202310909281.1A
Authority: CN
Inventors: 杨栋权
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-07-24
Filing date: 2023-07-24
Publication date: 2023-10-20
Anticipated expiration: 2043-07-24
Also published as: CN116631011A

Abstract

The application discloses a hand gesture estimation method and electronic equipment, and relates to the field of terminals, wherein the method comprises the following steps: the electronic device acquires an image 1 comprising the hand of the user through the camera, and determines one or more mask images based on the image 1: mask image 2, mask image 3, and mask image 4. Next, the electronic device may determine the first portion to be removed from the mask image (e.g., mask image 3 or mask image 4) of the non-target hand, and determine the second portion to be completed from the mask image 2 of the target hand. The electronic device may remove the first part from the image 1 and complement the second part, determining the image 2 comprising the complete target hand. The electronic device then determines the hand pose of the user from the image 2.

Description

Hand gesture estimation method and electronic equipment

Technical Field

The present application relates to the field of terminals, and in particular, to a hand gesture estimation method and an electronic device.

Background

With the development of computer technology, augmented Reality (AR) technology and Virtual Reality (VR) technology are also increasingly used in daily life. In AR/VR technology, an electronic device may estimate a hand gesture of a user for gesture interaction with the electronic device. However, when the hands of the user are blocked, the hand gesture estimation result of the electronic device is not accurate enough, and the error is large, so that the user cannot smoothly perform gesture interaction with the electronic device.

Disclosure of Invention

The application provides a hand gesture estimation method and electronic equipment, which can accurately estimate the hand gesture of a user in a scene that the hand is blocked by the electronic equipment, reduce the error of the hand gesture estimation result and enable the user to smoothly perform gesture interaction with the electronic equipment.

In a first aspect, the present application provides a hand gesture estimation method, including: a first image is acquired, the first image including a hand of a user. One or more mask images are determined based on the first image, the one or more mask images including the first mask image, the second mask image, and the third mask image. The first mask image is used for representing the visible area of the target hand in the first image, the second mask image is used for representing the visible area of the hand which is shaded from the target hand in the first image, and the third mask image is used for representing the visible area of the hand which is shaded from the target hand in the first image. A first occlusion relationship is determined based on the one or more mask images. The first occlusion relationship is used for representing an occlusion relationship between the target hand and the non-target hand. Based on the first occlusion relationship, a second image is generated from the first image, the second image including the complete hand of the subject. And determining the hand gesture of the user from the second image.

In one possible implementation manner, the determining one or more mask images based on the first image specifically includes: a fourth mask image and a first feature image of the first image are generated based on the first image through a first segmentation network. Wherein the fourth mask image is used to represent the region of the user's hand contour on the first image. And inputting the fourth mask image and the first characteristic image into a second segmentation network, and adjusting the weight of the hand outline in the first characteristic image in the second segmentation network to be larger than the weight of the non-hand outline in the second segmentation network based on the fourth mask image. The one or more mask images are generated by the second segmentation network based on the weight of the hand contour in the second segmentation network and the weight of the non-hand contour in the second segmentation network.

In one possible implementation manner, the generating a second image from the first image based on the first occlusion relation specifically includes: and when the first shielding relation is that the non-target hand shields the target hand, removing part or all of the visible area of the non-target hand from the first image, and complementing the shielded area of the target hand to generate a second image. And when the first shielding relation is that the target hand shields the non-target hand, removing part or all of the visible area of the non-target hand from the first image, and generating a second image.

In one possible implementation, the method for completing the area where the target hand is covered specifically includes: and determining filling information from the part of the target hand which is not covered. And filling the filling information into the covered area of the target hand.

In one possible implementation, the padding information includes one or more of the following: color and texture features of the portion of the target hand that is not obscured.

In one possible implementation manner, the determining the first occlusion relationship based on the one or more mask images specifically includes: when the first mask image shows a visible region of a target hand in the first image and the second mask image shows a visible region of a hand of the target hand in the first image, the first occlusion relationship occludes the target hand for the non-target hand. When the first mask image shows a visible region of a target hand in the first image and the third mask image shows a visible region of a hand occluded by the target hand in the first image, the first occlusion relationship is that the target hand occludes the non-target hand.

In one possible implementation manner, determining the hand gesture of the user from the second image specifically includes: one or more hand nodes are extracted from the second image. Coordinates of the one or more hand joints in 3D space are determined. Based on the coordinates of the one or more hand joints in 3D space, a hand pose of the user is determined.

In a second aspect, the present application provides an electronic device comprising: one or more processors and one or more memories. The one or more memories are coupled to the one or more processors, the one or more memories being configured to store a computer executable program that, when executed by the one or more processors, causes the electronic device to perform the method of any of the possible implementations of the first aspect.

In a third aspect, the present application provides a chip system comprising processing circuitry and interface circuitry for receiving code instructions and transmitting to the processing circuitry, the processing circuitry being operable to execute the code instructions to cause the chip system to perform the method of any one of the possible implementations of the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium comprising a computer executable program which, when run on an electronic device, causes the electronic device to perform the method of any one of the possible implementations of the first aspect.

Drawings

Fig. 1A is a specific flow chart of a hand gesture estimation method according to an embodiment of the present application;

FIG. 1B is a schematic diagram of a mask image according to an embodiment of the present application;

fig. 2A is a schematic diagram of a hand segmentation module according to an embodiment of the present application;

FIG. 2B is a schematic diagram of a hand de-occlusion module according to an embodiment of the present application;

fig. 2C is a schematic diagram of a hand gesture estimation module according to an embodiment of the present application;

fig. 3 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The terminology used in the following embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this disclosure is meant to encompass any or all of the possible combinations of one or more of the listed items. In the embodiments of the present application, the terms "first", "second" are used for descriptive purposes only and are not to be construed as implying relative importance or implying a number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.

In Augmented Reality (AR) technology/Virtual Reality (VR) technology in the computer field, real-time hand gesture estimation is an important step in gesture interaction between a user and an electronic device. In one implementation, the electronic device may acquire an RGB image including the user's hand via a camera and then segment out the user's hand region (which may also be referred to as a hand contour) from the above RGB image including the user's hand using a neural network. The electronic equipment can estimate the gesture of the hand of the user based on the segmented hand region of the user, so that gesture interaction between the user and the electronic equipment is realized. Among them, RGB image refers to color imaging that displays an object based on superposition of three components (which may also be referred to as channels) of red (R) component, green (G) component, and blue (B) component.

However, in a scene where the hands are blocked, due to the reasons of mutual blocking of the left hand and the right hand, similar texture of the two hands, similar characteristics of corresponding joints of the left hand and the right hand, and the like, the hand gesture of the user cannot be accurately estimated by the implementation manner, so that the error of the hand gesture estimation result is large, and the user cannot perform gesture interaction with the electronic equipment.

Accordingly, the present application provides a hand pose estimation method, which may include: the electronic device may acquire an image 1 through the camera, the image 1 comprising a user's hand. The electronics determine one or more mask images based on image 1, the one or more mask images including mask image 2, mask image 3, and mask image 4. Wherein the mask image 2 is used for representing the visible area of the target hand in the image 1, the mask image 3 is used for representing the visible area of the hand which obstructs the target hand in the image 1, and the mask image 4 is used for representing the visible area of the hand which is obstructed by the target hand in the image 1. The electronic device may then determine a first occlusion relationship based on the one or more mask images. The first occlusion relationship is used for representing an occlusion relationship between the target hand and the non-target hand. The electronic device may then generate an image 2 from the image 1 based on the first occlusion relation, the image 2 comprising the complete target hand. The electronic device determines the hand pose of the user from the image 2.

Wherein mask image 2, mask image 3, and mask image 4 may be used to represent occlusion relationships of a target hand (also known as a target hand) and a non-target hand (also known as a non-target hand). For example, a) when the mask image 2 shows a visible region of the target hand in the image 1, the mask image 3 does not show a visible region of the hand that obstructs the target hand in the image 1, and the mask image 4 shows a visible region of the hand that is obscured by the target hand in the image 1, that is, there is a hand that is obscured by the target hand, there is no hand that obstructs the target hand, and the occlusion relationship of the target hand and the non-target hand is: the target hand obscures non-target hands; b) When the mask image 2 shows the visible region of the target hand in the image 1, the mask image 3 shows the visible region of the hand which obstructs the target hand in the image 1, and the mask image 4 does not show the visible region of the hand which is obstructed by the target hand in the image 1, that is, there is a hand which obstructs the target hand, there is no hand which is obstructed by the target hand, and the occlusion relationship between the target hand and the non-target hand is: the non-target hand obscures the target hand.

In one possible implementation, the electronic device may also not generate mask image 3 when the target hand is occluding the non-target hand, only mask image 2 and mask image 4 are present; when the non-target hand is covering the target hand, the electronic device may not generate the mask image 4, and only the mask image 2 and the mask image 3 may be present.

The target hand may be either the left hand or the right hand. The non-target hand may be either the right hand or the left hand. The target hand and the non-target hand are not identical. When the left hand of the user is the target hand, the right hand of the user is the non-target hand. When the right hand of the user is the target hand, the left hand of the user is the non-target hand; the first region may be part or all of the non-target hand viewable area.

The image 1 and the image 2 can be RGB images or other types of images, and the application is not limited; the feature image of the image 1 refers to an image having one or more features of color features, texture features, shape features, spatial relationship features, and the like of the image 1.

The types of Mask image 2, mask image 3, mask image 4, and Mask image 1 in the subsequent embodiments may be Mask images (masks), in which the Mask images may highlight the corresponding target areas. For example, the target area of mask image 1 is the visible area of the user's hand contour in image 1, and mask image 1 may highlight the target area. The target area of the mask image 2 is the visible area of the target hand in the image 1, and the mask image 2 may highlight the target area. Mask image 3, mask image 4, etc. mask images can be analogized.

Specifically, the mask image is a binary image composed of 0 and/or 1. In the mask image, a 1-value region is a target region, which can be used for subsequent image processing, and a 0-value region is a non-target region, which is not used for subsequent image processing. Illustratively, for example, the mask image 1 has a value of 1 for the user's hand region and 0 for the other regions other than the user's hand region; the value of the target hand region in the mask image 2 is 1, and the values of the regions other than the target hand region are 0. The mask image 3 has a value of 1 in a hand region that covers the target hand and a value of 0 in other regions. The value of the region of the mask image 4 where the hand is blocked by the target hand is 1, and the values of the other regions are 0. In other implementations, the mask image may also have a 0-value region as the target region and a 1-value region as the non-target region, which is not limited in the present application. The following embodiment will be described taking a 1-value region as a target region as an example.

According to the hand gesture estimation method provided by the application, the electronic equipment can accurately estimate the hand gesture of the user in the scene of the blocked hand, and the error of the hand gesture estimation result is reduced, so that the user can smoothly perform gesture interaction with the electronic equipment.

Fig. 1A illustrates a specific flow of a hand gesture estimation method according to an embodiment of the present application.

As shown in fig. 1A, the specific flow of the method may include:

stage one: the electronic device determines one or more mask images based on image 1.

S101: the electronic device captures an image 1 comprising the user's hand via the camera.

In the embodiment of the application, the electronic device may be a wearable device (for example, a smart watch, a smart bracelet, a head-mounted device, etc.), a mobile phone, a tablet computer, a PC, a super mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, a personal digital assistant (personaldigital assistant, PDA), etc. The application is not limited in any way by the particular type of electronic device.

In the embodiment of the present application, the image 1 may be an RGB image, or may be another type of image, and the present application is not limited to the type of the image 1.

S102: the electronic device generates a mask image 1 and a feature image of the image 1 based on the image 1 by a hand segmentation module.

Specifically, the electronic device may generate the mask image 1 and the feature image of the image 1 based on the image 1 through the first segmentation network in the hand segmentation module. Mask image 1 may be used to represent the visible area of the entire hand contour (including both target and non-target hands) on image 1.

For the description of the MASK image (MASK, which may also be referred to as MASK, etc.), reference may be made to the description in the foregoing embodiment, and details are not repeated herein. Mask image 1 is a binary image composed of 0 and/or 1, and the user's hand area may be an area composed of 1 value, and the other areas other than the user's hand area may be an area composed of 0 value.

S103: the electronic equipment acquires one or more mask images from the mask image 1 and the characteristic image of the image 1 by using a hand segmentation module: mask image 2, mask image 3, and mask image 4.

For the description of the mask image 2, the mask image 3, and the mask image 4, reference is made to the foregoing description.

Specifically, the electronic device may input the mask image 1 and the feature image of the image 1 into the second segmentation network in the hand segmentation module, and adjust the weight of the hand contour in the feature image of the image 1 in the second segmentation network to be greater than the weight of the non-hand contour in the second segmentation network based on the mask image 1. The electronic device may generate one or more mask images over the second segmentation network based on the weights of the hand contours in the second segmentation network and the weights of the non-hand contours in the second segmentation network.

Specifically, the following implementations may be used for how the electronic device determines the target hand:

in one possible implementation, the electronic device may preset the target hand to be the user's left or right hand. When the electronic device acquires the image 1, the left hand or the right hand of the user can be detected from the image 1, and the left hand or the right hand of the user is determined as a target hand.

In another possible implementation manner, the electronic device may pre-store the mapping relationship between the application scenario and the target hand. For example, the target hand corresponding to the application scene a is the left hand, the target hand corresponding to the application scene B is the right hand, the target hand corresponding to the application scene C is the right hand, and so on. When the electronic device detects that the application scene in the image 1 is the application scene a, the electronic device can determine that the target hand is the left hand. The electronic device may detect the left hand of the user from the image 1 and determine the left hand of the user as the target hand.

In another possible implementation, the electronic device may determine the target hand based on the history. For example, if the electronic device detects that the user uses the left hand to perform gesture interaction with the electronic device in the previous frame or the N frame image, the electronic device may determine the left hand as the target hand in the current frame image; if the electronic equipment detects that the user uses the right hand to perform gesture interaction with the electronic equipment in the previous frame or the N frames of images, the electronic equipment can determine the right hand as a target hand in the current frame of images. Wherein N may be greater than or equal to 2.

The above-described exemplary implementations are merely illustrative of the present application, and in particular implementations, the electronic device may determine the target hand in other ways, and the present application is not limited in this regard.

To better illustrate the occlusion relationship of the hand, several examples are given below: for example, if the left hand of the user is blocked by the right hand of the user and the left hand of the user is the target hand, the value of the left hand region of the user in the mask image 2 is 1, the value of the right hand region of the user in the mask image 3 is 1, and the mask image 4 is an image composed of all 0 values (i.e., the hand region not blocked by the left hand); for another example, if the right hand of the user is blocked and the right hand of the user is the target hand, the value of the right hand region of the user in the mask image 2 is 1, the mask image 3 is an image composed of all 0 values (i.e., the hand region where the right hand is not blocked), and the value of the left hand region of the user in the mask image 4 is 1.

In one possible implementation, the electronic device may also generate one or more of the following mask images based on image 1 using the hand segmentation module: mask image 2, mask image 3, and mask image 4. That is, in such an implementation, the electronic device does not generate mask image 1, and does not need to generate one or more mask images as described above using mask image 1.

Stage two: the electronic device generates an image 2 from the image 1 based on the first occlusion relation.

Specifically, when the first occlusion relationship is that the non-target hand occludes the target hand, removing part or all of the visible region of the non-target hand from the image 1, and complementing the occluded region of the target hand to generate an image 2. When the first occlusion relationship is that the target hand occludes the non-target hand, removing part or all of the visible region of the non-target hand from the image 1 to generate an image 2. The specific implementation manner can be as follows:

s104: the electronic device determines a first portion to be removed from a mask image (e.g., mask image 3 or mask image 4) of a non-target hand using the hand de-occlusion module.

The part or the whole of the 1-value area in the non-target hand mask image is the first part to be removed. The first region may be part or all of the non-target hand viewable area.

By way of example, fig. 1B shows a mask image of a non-target hand in which the white area is a 1-value area (i.e., an area of the non-target hand) and the black area is a 0-value area (i.e., unless the target hand is elsewhere). The electronic device may determine a 1-value region in the non-target hand mask image as the first location to be removed.

S105: the electronic equipment utilizes the hand de-shielding module to determine a second part needing to be complemented from the mask image 2.

The second part is the area where the target hand is shielded. In the embodiment of the present application, this step is optional, that is, when the target hand is blocked, the electronic device may determine the second portion to be completed from the mask image 2. If the target hand is not occluded, the electronic device may not perform this step.

S106: the electronic device may use the hand de-occlusion module to remove the first portion from the image 1 and to complement the second portion to determine the image 2 including the complete target hand.

The second portion may include: the electronic device determines filling information from the part of the target hand which is not covered, and then the electronic device fills the filling information into the covered region of the target hand. The padding information includes one or more of the following: color and texture features of the portion (also called the area) of the target hand that is not obscured.

The image 2 may be an RGB image or another type of image, and the present application is not limited to the type of the image 2. In the embodiment of the present application, if the target hand is not blocked, that is, the target hand does not have the second portion to be complemented, the electronic device may not perform the operation of determining and complementing the second portion described in the embodiment, that is, the electronic device may remove the first portion from the image 1, and determine the image 2 including the complete target hand.

Stage three: the electronic device determines the hand pose of the user based on the image 2.

S107: the electronic device may determine the hand pose of the user from the image 2 using the hand pose estimation module.

In one implementation, the electronic device may take image 2 as input to a hand pose estimation module, through which a set of predefined hand joints are extracted from image 2, and coordinates of one or more hand joints in 3D space are determined. The electronic device may determine a hand pose of the user based on coordinates of one or more hand joints in 3D space. The hand joint points may be selected by referring to the actual hand joint points, and the number of the hand joint points may be 14, 16, 21, or the like. Without limitation, in the embodiment of the present application, the electronic device may determine the hand gesture of the user based on the image 2 in other manners, which is not limited by the present application.

Next, the "hand segmentation module", "hand de-occlusion module", and "hand pose estimation module" in the embodiment shown in fig. 1A will be further described.

A) Hand segmentation module

For example, as shown in fig. 2A, the hand segmentation module may be composed of one or more of the following neural networks: backbone network (Backbone network, including Backbone network 1 and Backbone network 2), head network (Head network, including Head network 1 and Head network 2), and Box layer network (including Box layer network 1 and Box layer network 2), wherein:

the input of the backup network 1 is an image 1 including the hand of the user, and the backup network 1 and the Head network 1 can be used to extract feature information of the image 1 and generate a feature image of the image 1 based on the feature information of the image 1. The input of the Box layer network 1 is a characteristic image of the image 1, and the output is a mask image 1. Then, the electronic device may take the mask image 1 and the feature image of the image 1 as inputs of the Head network 2, and when the Head network 2 segments the hand region of the user again based on the feature image of the image 1, the weight of the hand region highlighted in the mask image 1 may be set to be higher than the weight of the other regions in the feature image of the image 1, and the hand segmentation module may pay more attention to the hand region highlighted in the mask image 1 during the segmentation, so that the segmentation accuracy of the hand region of the user may be improved. The electronic device outputs one or more mask images based on the Head network 2 and the Box layer network 2: mask image 2, mask image 3, and mask image 4, the description of which can be referred to the foregoing description.

Wherein, in the training process, the Box layer network 2 may set three labels: a first label corresponding to the visible area of the target hand in the image 1, a second label corresponding to the visible area of the hand in the image 1, which covers the target hand, and a third label corresponding to the visible area of the hand in the image 1, which covers the target hand. During the training phase, the Box layer network 2 may compare the differences between the test outputs of the three tags and the actual data and adjust the weights until a weight is determined that minimizes the differences between the test outputs of the three tags and the actual data. Wherein:

the test output of the first label refers to the visible area of the target hand in the image 1, which is output by the Box layer network 2, and the actual data refers to the visible area of the actual target hand in the image 1; the test output of the second label refers to the visible area of the hand of the target hand in the image 1, which is output by the Box layer network 2, and the actual data refers to the visible area of the hand of the target hand in the image 1, which is actually shielded; the test output of the third label refers to the visible area of the hand blocked by the target hand in the image 1, which is output by the Box layer network 2, and the actual data refers to the visible area of the hand actually blocked by the target hand in the image 1.

In the embodiment of the present application, the structure of the hand segmentation module shown in fig. 2A is only used for exemplary explanation of the present application, and in other implementations, the hand segmentation module may be different from the structure of fig. 2A, which is not limited by the present application.

B) Hand de-shielding module

Illustratively, as shown in FIG. 2B, the structure of the hand de-occlusion module may consist of one or more convolutional networks and a variance block network (Transformer Block network). Mask image 2, a mask image of a non-target hand (e.g., mask image 3 or mask image 4), and image 1 are input to a hand de-occlusion module after feature fusion (concat). The output of the hand de-occlusion module may be an image 2 comprising the complete target hand. The specific operation of the hand de-occlusion module may be referred to the embodiment shown in fig. 1A described above.

In the embodiment of the present application, the structure of the hand de-shielding module shown in fig. 2B is only used for exemplary explanation of the present application, and in other implementations, the hand de-shielding module may be a structure different from that of fig. 2B, which is not limited by the present application.

C) Hand gesture estimation module

Illustratively, as shown in fig. 2C, the hand pose estimation module may be comprised of a convolutional neural network KeyNet network. The input to the hand pose estimation module may be an image 2 comprising the complete target hand and the output may be coordinates of the hand joint point in 3D space. The specific operation of the hand pose estimation module may refer to the embodiment shown in fig. 1A described above.

In the embodiment of the present application, the structure of the hand gesture estimation module shown in fig. 2C is only used for exemplary explanation of the present application, and in other implementations, the hand gesture estimation module may be a structure different from that of fig. 2C, which is not limited by the present application.

In an embodiment of the present application, image 1 may be referred to as a first image and image 2 may be referred to as a second image. Mask image 1 may be referred to as a fourth mask image, mask image 2 may be referred to as a first mask image, mask image 3 may be referred to as a second mask image, mask image 4 may be referred to as a third mask image, and the feature image of image 1 may be referred to as a first feature image. The backhaul network 1, the Head network 1, and the Box layer network 1 may be collectively referred to as a first split network, and the Head network 2 and the Box layer network 2 may be collectively referred to as a second split network.

Fig. 3 schematically illustrates a hardware structure of the electronic device 100 according to an embodiment of the present application.

In the embodiment of the present application, the electronic device 100 may be the electronic device described above.

As shown in fig. 3, electronic device 100 may include a processor 301, a memory 302, a wireless communication module 303 (optional), a display 304, a camera 305, an audio module 306 (optional), and a microphone 307 (optional), where processor 301, memory 302, wireless communication module 303 (optional), display 304, camera 305, audio module 306 (optional), and microphone 307 (optional) may be connected by a bus.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may also include more or fewer components than shown in FIG. 3, or may combine certain components, or split certain components, or a different arrangement of components. The components shown in fig. 3 may be implemented in hardware, software, or a combination of software and hardware.

The processor 301 may include one or more processor units, for example, the processor 301 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 301 for storing instructions and data. In some embodiments, the memory in the processor 301 is a cache memory. The memory may hold instructions or data that the processor 301 has just used or recycled. If the processor 301 needs to reuse the instruction or data, it may be called directly from the memory. Repeated accesses are avoided and the latency of the processor 301 is reduced, thus improving the efficiency of the system.

In some embodiments, processor 301 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a USB interface, among others.

A memory 302 is coupled to the processor 301 for storing various software programs and/or sets of instructions. In particular implementations, memory 302 may include volatile memory (RAM), such as random access memory (random access memory); non-volatile memory (non-volatile memory) such as ROM, flash memory (flash memory), hard Disk Drive (HDD) or solid state Disk (Solid State Drives, SSD) may also be included; memory 302 may also include a combination of the types of memory described above. The memory 302 may also store some program code such that the processor 301 invokes the program code stored in the memory 302 to implement the method of implementing the embodiments of the present application in the electronic device 100. The memory 302 may store an operating system, such as an embedded operating system, for example uCOS, vxWorks, RTLinux.

The wireless communication module 303 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., applied to the electronic device 100. The wireless communication module 303 may be one or more devices integrating at least one communication processing module. The wireless communication module 303 receives electromagnetic waves via an antenna, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 301. The wireless communication module 303 may also receive a signal to be transmitted from the processor 301, frequency modulate and amplify the signal, and convert the signal into electromagnetic waves to radiate the electromagnetic waves through the antenna. In some embodiments, the electronic device 100 may also probe or scan for devices in the vicinity of the electronic device 100 by transmitting signals through a bluetooth module (not shown in fig. 3), a WLAN module (not shown in fig. 3) in the wireless communication module 303, and establish a wireless communication connection with the nearby devices to transmit data. The bluetooth module may provide a solution including one or more of classical bluetooth (BR/enhanced data rate) or bluetooth low energy (bluetooth low energy, BLE) communication, and the WLAN module may provide a solution including one or more of Wi-Fi direct, wi-Fi LAN, or Wi-Fi softAP.

The display 304 may be used to display images, video, and the like. The display 304 may include a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a mini, a Micro led, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 304, N being a positive integer greater than 1.

The camera 305 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, the electronic device 100 may include 1 or N cameras 305, N being a positive integer greater than 1.

The audio module 306 may be used to convert digital audio information to an analog audio signal output, and may also be used to convert an analog audio input to a digital audio signal. The audio module 306 may also be used to encode and decode audio signals. In some embodiments, the audio module 306 may also be provided in the processor 301, or part of the functional modules of the audio module 306 may be provided in the processor 301.

The microphone 307, which may also be referred to as a "microphone" or "microphone", may be used to collect sound signals in the environment surrounding the electronic device, convert the sound signals into electrical signals, and then subject the electrical signals to a series of processes, such as analog-to-digital conversion, to obtain digital audio signals that may be processed by the processor 301 of the electronic device. When making a call or transmitting voice information, the user can sound near the microphone 307 through the mouth, inputting a sound signal to the microphone 307. The electronic device 100 may be provided with at least one microphone 307. In other embodiments, the electronic device 100 may be provided with two microphones 307, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four or more microphones 307 to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc.

The electronic device 100 may also include a sensor module (not shown in fig. 3). The sensor module may include a plurality of sensing devices, such as touch sensors (not shown in fig. 3), and the like. Touch sensors may also be referred to as "touch devices". The touch sensor may be disposed on the display screen 304, and the touch sensor and the display screen 304 form a touch screen, which is also referred to as a "touch screen". The touch sensor may be used to detect touch operations acting on or near it.

It should be noted that, the electronic device 100 shown in fig. 3 is only for exemplarily explaining the hardware structure of the electronic device provided by the present application, and does not limit the present application in particular.

As used in the above embodiments, the term "when …" may be interpreted to mean "if …" or "after …" or "in response to determination …" or "in response to detection …" depending on the context. Similarly, the phrase "at the time of determination …" or "if detected (a stated condition or event)" may be interpreted to mean "if determined …" or "in response to determination …" or "at the time of detection (a stated condition or event)" or "in response to detection (a stated condition or event)" depending on the context.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: ROM or random access memory RAM, magnetic or optical disk, etc.

Claims

1. A hand pose estimation method, comprising:

acquiring a first image, wherein the first image comprises a hand of a user;

generating a fourth mask image and a first feature image of the first image based on the first image through a first segmentation network; wherein the fourth mask image is used for representing the region of the hand outline on the first image;

inputting the fourth mask image and the first characteristic image into a second segmentation network, and adjusting the weight of the hand outline in the first characteristic image in the second segmentation network to be larger than the weight of the non-hand outline in the second segmentation network based on the fourth mask image;

generating, by the second segmentation network, a plurality of mask images based on the weight of the hand contour in the second segmentation network and the weight of the non-hand contour in the second segmentation network;

wherein the plurality of mask images comprise a first mask image, a second mask image and a third mask image; the first mask image is used for representing a visible area of a target hand in the first image, the second mask image is used for representing a visible area of a hand which is shaded from the target hand in the first image, and the third mask image is used for representing a visible area of the hand which is shaded from the target hand in the first image;

determining a first shielding relation based on the plurality of mask images; the first shielding relation is used for representing shielding relation between the target hand and the non-target hand;

when the first occlusion relation is that the target hand occludes the non-target hand, the first mask image shows a visible area of the target hand in the first image, the second mask image does not show a visible area of the hand occluded by the target hand in the first image, and the third mask image shows a visible area of the hand occluded by the target hand in the first image;

when the first occlusion relationship is that the non-target hand occludes the target hand, the first mask image shows a visible region of the target hand in the first image, the second mask image shows a visible region of the hand occluding the target hand in the first image, and the third mask image does not show a visible region of the hand occluded by the target hand in the first image;

generating a second image from the first image based on the first occlusion relationship, the second image including the complete hand of the subject;

and determining the hand gesture of the user from the second image.

2. The method according to claim 1, wherein the generating a second image from the first image based on the first occlusion relation, in particular comprises:

when the first shielding relation is that the non-target hand shields the target hand, removing part or all of the visible area of the non-target hand from the first image, complementing the shielded area of the target hand, and generating a second image;

and when the first occlusion relation is that the target hand occludes the non-target hand, removing part or all of the visible area of the non-target hand from the first image, and generating a second image.

3. The method according to claim 2, wherein said complementing the area of the target hand that is obscured comprises:

determining filling information from the part of the target hand which is not covered;

and filling the filling information into the covered area of the target hand.

4. A method according to claim 3, wherein the padding information comprises one or more of: color and texture features of the portion of the target hand that is not obscured.

5. The method according to claim 1, wherein determining the hand gesture of the user from the second image comprises:

extracting one or more hand joints from the second image;

determining coordinates of the one or more hand joint points in a 3D space;

and determining the hand gesture of the user based on the coordinates of the one or more hand joints in the 3D space.

6. An electronic device, comprising: one or more processors and one or more memories; the one or more memories coupled to the one or more processors, the one or more memories for storing a computer executable program that, when executed by the one or more processors, causes the electronic device to perform the method of any of claims 1-5.

7. A chip system comprising processing circuitry and interface circuitry, the interface circuitry to receive code instructions and to transmit to the processing circuitry, the processing circuitry to execute the code instructions to cause the chip system to perform the method of any of claims 1-5.

8. A computer readable storage medium comprising a computer executable program which, when run on an electronic device, causes the electronic device to perform the method of any one of claims 1-5.