CN114138121A

CN114138121A - User gesture recognition method, device and system, storage medium and computing equipment

Info

Publication number: CN114138121A
Application number: CN202210117079.0A
Authority: CN
Inventors: 冯翀; 张梓航; 王丽婷; 郭嘉伟; 张梦遥; 王宇轩
Original assignee: Beijing Shenguang Technology Co ltd
Current assignee: Beijing Shenguang Technology Co ltd
Priority date: 2022-02-07
Filing date: 2022-02-07
Publication date: 2022-03-04
Anticipated expiration: 2042-02-07
Also published as: CN114138121B

Abstract

The application discloses a user gesture recognition method, a device, a system, a storage medium and a computing device, wherein the method comprises the following steps: starting one or a group of pre-erected camera devices to acquire images, wherein an interactive surface is arranged in the visual field range of one camera device or at least partially overlapped visual field ranges of a group of camera devices; reading image information acquired by one or a group of camera devices, wherein the image information at least comprises visual image information and depth image information; judging whether a user hand image is contained in the visual field range or not according to the visual image information, if so, positioning skeleton nodes and fingertip nodes of the user hand in the user hand image; and judging whether the gesture operation of the user aims at the interactive surface or not according to the depth image information and the fingertip nodes, and if so, identifying the gesture of the user according to the image information to execute the specified operation of the user. The application solves the technical problem that the remote touch effect is poor in the prior art.

Description

User gesture recognition method, device and system, storage medium and computing equipment

Technical Field

The application relates to the technical field of human-computer interaction, in particular to a user gesture recognition method, device and system, a storage medium and a computing device.

Background

A common touch identification scheme is performed based on infrared rays and an infrared camera, and by emitting an infrared signal on a plane, shielding of the infrared rays can be realized when a user performs touch actions such as pressing and the like; and then infrared camera alright catch and receive the produced infrared facula that shelters from at present, carry out the analysis to infrared facula, just obtained user's current touch position, but this technique has a lot of restrictions.

The projection of infrared rays needs to be tightly attached to the touch table, so that light spots shielded by a user can be correctly identified by the infrared camera, and the infrared camera is easily influenced for a non-complete plane, such as a spherical surface. And when an object is placed on an interactive plane (generally a desktop), false touch can be caused, and the object can destroy the continuity of the infrared grating layer to cause the fault of the interactive plane.

The infrared touch control identification method is limited by the equipment, an infrared signal transmitting unit and an infrared camera for identifying the infrared shielding condition are required to be installed on a plane, the equipment is not convenient to transfer and expand, and the use scene is easy to limit.

Limited by distance, infrared touch requires two devices to cooperate, so if touch at a longer distance is required, two problems are easily encountered: firstly, infrared emission unit can not be installed, secondly because the distance leads to the air to receive the influence to infrared camera's discernment to infrared ray too far away, consequently can not be applicable to the remote control sight.

Aiming at the technical problem of poor remote touch effect in the prior art, no effective solution is provided at present.

Disclosure of Invention

The embodiment of the application provides a user gesture recognition method, a user gesture recognition device, a user gesture recognition system, a storage medium and computing equipment, and at least solves the technical problem that in the prior art, a remote touch effect is poor.

According to an aspect of an embodiment of the present application, there is provided a user gesture recognition method, including: starting one or a group of pre-erected camera devices to acquire images, wherein an interactive surface is arranged in the visual field range of one camera device or at least partially overlapped visual field ranges of a group of camera devices, so that a user can perform gesture operation on the interactive surface by using a hand; reading image information acquired by the one or a group of camera devices, wherein the image information at least comprises visual image information and depth image information; judging whether a user hand image is contained in the visual field range or not according to the visual image information, if so, positioning skeleton nodes and fingertip nodes of a user hand in the user hand image; and judging whether the gesture operation of the user aims at the interactive surface or not according to the depth image information and the fingertip node, and if so, identifying the gesture of the user according to the image information to execute the operation specified by the user.

According to another aspect of the embodiments of the present application, there is provided a user gesture recognition apparatus, including: the camera module comprises one or a group of pre-erected camera devices and is used for acquiring images, wherein an interactive surface is arranged in the visual field range of one camera device or at least partially overlapped visual field ranges of a group of camera devices, so that a user can perform gesture operation on the interactive surface by using a hand; the reading module is used for reading image information acquired by the one or the group of camera devices, wherein the image information at least comprises visual image information and depth image information; the first judgment module is used for judging whether the visual field range contains a user hand image or not according to the visual image information, and if yes, positioning skeleton nodes and fingertip nodes of a user hand in the user hand image; and the second judging module is used for judging whether the gesture operation of the user aims at the interactive surface according to the depth image information and the fingertip node, and if so, identifying the gesture of the user according to the image information to execute the specified operation of the user.

According to another aspect of the embodiments of the present application, there is provided a user gesture recognition system, including: the system comprises one or a group of pre-erected camera devices, a touch screen and a touch screen, wherein the pre-erected camera device or the group of camera devices is used for acquiring images, and an interaction surface is arranged in a visual field range of one camera device or at least partially overlapped visual field ranges of the group of camera devices so that a user can perform gesture operation on the interaction surface by using a hand; the pre-erected projection device is used for projecting a user interaction interface on the interaction surface, and at least one target object which can be operated by a user is displayed on the user interaction interface; a memory storing instructions; a processor executing the instructions to perform: the image information acquired by the one or the group of camera devices is read, wherein the image information at least comprises visual image information and depth image information, the first judgment module is used for judging whether a hand image of a user is contained in the visual field range or not according to the visual image information, if yes, skeleton nodes and fingertip nodes of the hand of the user are positioned in the hand image of the user, the second judgment module is used for judging whether gesture operation of the user is directed at the interactive surface or not according to the depth image information and the fingertip nodes, and if yes, the gesture of the user is recognized according to the image information so as to execute user specified operation.

On the basis of any one of the above embodiments, predicting a thermodynamic diagram for each bone node from the hand image, wherein the probability that the data of a pixel position in the thermodynamic diagram is that the pixel position belongs to a certain bone node comprises: labels obtained using the hourglass

Generating a thermodynamic diagram of a hand joint point k

Wherein the thermodynamic diagram is a probability diagram, and is consistent with the pixel composition of the picture, and each pixel is positionedThe put data is the probability that the current pixel is a certain joint:

(ii) a Wherein the content of the first and second substances,

a thermodynamic diagram of the recognition accuracy of the hand joint points in each pixel region, which is a probability diagram, has been explained in the article;

: processing the picture by using a hourglass frame to obtain data corresponding to each pixel point; p: similarly, for the predicted position of the acceptor stage in the image obtained after the hourglass frame processing, the Σ is a thermodynamic diagram local action range control parameter, and the parameter value is set to be the square of the bandwidth parameter in the gaussian kernel function; according to thermodynamic diagram

Further obtaining the position L of the hand joint point k in the image, and further correcting the position based on the predicted position to obtain more accurate position information, wherein L is based on

And

by calculating the position of the corrected hand joint point in the image,

。

according to another aspect of the embodiments of the present application, there is provided a storage medium including a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the method of any of the above embodiments.

According to another aspect of embodiments of the present application, there is provided a computing device comprising a processor for executing a program, wherein the program executes to perform the method of any of the above embodiments.

In the embodiment of the application, image acquisition is carried out by starting one or a group of pre-erected camera devices, wherein an interactive surface is arranged in the visual field range of one camera device or at least partially overlapped visual field ranges of a group of camera devices, so that a user can perform gesture operation on the interactive surface by using a hand; reading image information acquired by the one or a group of camera devices, wherein the image information at least comprises visual image information and depth image information; judging whether a user hand image is contained in the visual field range or not according to the visual image information, if so, positioning skeleton nodes and fingertip nodes of a user hand in the user hand image; and judging whether the gesture operation of the user aims at the interactive surface or not according to the depth image information and the fingertip node, if so, identifying the gesture of the user according to the image information to execute the specified operation of the user, thereby realizing the technical effect of accurate remote touch control and further solving the technical problem of poor remote touch control effect in the prior art.

The application has the following beneficial effects:

1. the touch identification technology can be suitable for more scenes and is not limited by a plane any more, and excellent identification can be realized on a curved surface and a desktop with disordered object distribution.

2. The user can customize the interactive plane to a certain extent without being limited to the desktop. For example, when a user places a book on a desktop, it is also possible to perform a click operation with the book plane as the interactive surface.

3. The equipment for touch identification is optimized, only the RGB camera and the depth camera (or a binocular camera) are needed, and then the algorithm is written into the computing board, so that the identification of user touch can be realized, and the equipment is more flexible.

4. The touch technology can be applied to a remote touch scene, and the remote control requirement of a user can be realized without the assistance of other equipment.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing a user gesture recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of user gesture recognition according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a projected user interaction interface according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a neural network infrastructure for identifying a user's palm according to an embodiment of the present application;

FIG. 5 is a schematic diagram of predefined hand region bone joint points according to an embodiment of the present application;

FIG. 6a is a schematic diagram of detecting an image of a user's hand and marking bone joint points in visual image information according to an embodiment of the present application;

FIG. 6b is a schematic diagram of the calibration of the bone joint points of the user's hand at depth image information according to an embodiment of the present application;

FIG. 7 is a flow chart of a method of determining an interactive surface according to an embodiment of the present application;

FIG. 8 is a schematic diagram of another method of determining an interaction surface according to an embodiment of the present application;

FIG. 9 is a flow chart of another method of user gesture recognition according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a user gesture recognition apparatus according to an embodiment of the present application.

Detailed Description

The embodiments described below are only examples of a part of the present application, and not all examples. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

There is also provided, in accordance with an embodiment of the present application, an embodiment of a method for user gesture recognition, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system, such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing a user gesture recognition method. As shown in FIG. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102a-102n (the processors 102a-102n may include, but are not limited to, processing devices such as a microprocessor MCU or programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The one or more processors 102a-102n and/or other data processing circuitry described above may be embodied in whole or in part as software, hardware, firmware, or any combination thereof. The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the user gesture recognition determination method in the embodiment of the present application, and the processors 102a to 102n execute various functional applications and data processing by running the software programs and modules stored in the memory 104, so as to implement the above method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processors 102a-102n, which may be connected to the computer terminal 10 via a network. The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

Here, it should be noted that in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

The application operates a user gesture recognition method as shown in fig. 2 in the above operating environment. Fig. 2 is a flowchart of a user gesture recognition method according to an embodiment of the present application, where the method is applied to a mobile terminal, a computer, a projector, and other devices, and all of the devices may be implemented based on the computer terminal shown in fig. 1.

Referring to fig. 2, the user gesture recognition method may include:

step S202: starting one or a group of pre-erected camera devices to acquire images, wherein an interactive surface is arranged in the visual field range of one camera device or at least partially overlapped visual field ranges of a group of camera devices, so that a user can perform gesture operation on the interactive surface by using a hand;

in one alternative, the camera may be an RGB camera, or an RGB camera + a depth camera, or a binocular depth camera, or a TOF camera. The number of cameras included in the camera device is not limited, and only the visual image information and the depth image information can be acquired, wherein the visual image information can be color image information, gray image information or even infrared image information. For example, the depth can be recognized by three-dimensional structured light by directly using RGB and depth cameras, and the combination can achieve good recognition effect on the closer distance within 40 cm. For another example, the binocular camera may be slightly behind in real-time due to the characteristic of large calculation amount of the binocular camera, but the recognition range is no longer limited by 40cm, and the depth information can be recognized in a relatively close condition, so as to output an RGB diagram, and the joint prediction result is combined to jointly determine the touch behavior. Moreover, the binocular camera can simultaneously acquire RGB and depth images (the depth images need to be calculated through the RGB images), and image matching is simple.

In an alternative, when a group of cameras, i.e. containing a plurality of cameras or cameras, is pre-installed, although each camera may be individually configured according to the method described in the present application, that situation may fall within the scope of the solution of pre-installing a camera in the present application, where the pre-installation of a group of cameras focuses on the fused linkage between a group of cameras, which requires that the group of cameras have at least a small portion of overlapping field of view in order to be linked to recognize a gesture movement of a user in the overlapping field of view.

In an alternative, the interactive surface may be a plane, a curved surface with any curvature, or even an uneven surface, and the interactive surface may refer to a real object surface, such as a desktop or a wall, or may refer to a virtual object surface, such as a virtual surface corresponding to the object surface after any object with a certain height is placed on the desktop.

Step S204: reading image information acquired by the one or a group of camera devices, wherein the image information at least comprises visual image information and depth image information;

in one alternative, the visual image information and the depth image information can be acquired by different camera devices and processed synchronously; but also by the same camera device. In an optional implementation manner, the visual image information and the depth image information correspond to each other in both a time dimension and a space dimension, and in short, the visual image information and the depth image information are images of the same camera shooting region at the same time, so that depth fusion can be performed on the visual image information and the depth image information.

In an alternative, the visual image information may be, for example, original color image information or grayscale image information acquired by the camera device, an original visual image acquired by the camera device, or an original visual image after preliminary processing. The visual image information is used for monitoring the hand contour, the finger joint nodes and the fingertip nodes through a preset algorithm, and the preset algorithm can be realized through an image recognition technology.

In an alternative, in a specific embodiment, the depth image refers to an image in which the distance (depth value) from the camera to each point in the image to be captured is taken as the pixel value, that is, the depth image information may be an image in which the pixel values of all points in the image are depth values, and in another specific embodiment, the depth image information may also be the depth information of the pixel points in the local area in the visual image, in which case, the depth image information may be captured without a special depth camera, but only the visual image information is processed, the hand image therein is identified, and the skeleton node and the fingertip node of the user hand are located accordingly, so as to process the visual image information again, and the depth information in the area corresponding to the user hand image and even the fingertip node is obtained, which can simplify the system hardware resources because no special depth camera is needed, system processor resources may also be simplified because there is no need to process depth information within all image regions.

Step S206: judging whether a user hand image is contained in the visual field range or not according to the visual image information, if so, positioning skeleton nodes and fingertip nodes of a user hand in the user hand image;

in an alternative, the user's hand, such as the palm of the hand, analyzes the visual image information, determines that the user's palm is likely to be included in the visual image information, and determines that the user's hand image is included in the visual field range when the probability of detecting the user's palm is greater than a threshold value. And tracking and processing the hand image of the user, and determining skeleton nodes and fingertip nodes of the hand of the user.

In one alternative, the user's hand is brought within the field of view of, for example, an RGB camera that continuously takes hand pictures, resolves 21 key bone joint points of the hand through system processing, and tracks five key bone joint points of the fingertip.

In the present application, the palm of the user is identified based on the visual image information, and then the key bone joint point image in the palm is identified, and the user touch point is estimated by using the bone joint point. Compared with a scheme of modeling based on a picture and calculating the touch points of the user according to the hand contour and the fingertip image of the user in the picture, the identification technology based on the bone joint points is undoubtedly more efficient and accurate.

Step S208: and judging whether the gesture operation of the user aims at the interactive surface or not according to the depth image information and the fingertip node, and if so, identifying the gesture of the user according to the image information to execute the operation specified by the user.

In one alternative, the difference in depth between the fingertip and the interactive surface may reach a set threshold (about 6-10 mm) when the user clicks on the interactive surface. The depth camera continuously collects depth information in a range, and the system judges whether the depth information is a click event or not by judging the difference between the depth value corresponding to the coordinate of the fingertip node and the depth value of the interactive surface.

On the basis of any of the above embodiments, in step S202: before starting one or a group of pre-arranged camera devices for image acquisition, the method further comprises the following steps: one or a group of cameras is initialized.

On the basis of any of the above embodiments, the method further comprises: step S201: and starting the projection device to project a user interaction interface on the interaction surface, wherein at least one target object which can be operated by a user is displayed on the user interaction interface.

Fig. 3 is a schematic diagram of a projected user interface according to an embodiment of the present application, and as shown in fig. 3, the user interface projected on the desktop includes a plurality of target objects, such as pictures, text controls, control buttons, links, application icons, web pages, program interfaces, and the like, and has a plurality of selectable main themes, such as eye protection, night protection, and the like. The content displayed by the user interaction interface is not limited, and the method can be applied to the user interaction interface of the application as long as the content can be displayed on the user interaction interfaces of other technologies.

In an alternative, the projector, such as a projector, initializes the projector and projects the current operation interface information on the corresponding plane. The RGB camera and the depth camera start to work, and picture information flow is continuously input into the system.

On the basis of any of the above embodiments, step S206: judging whether the visual field range contains the hand image of the user according to the visual image information comprises the following steps:

step S2062: establishing a neural network model, wherein the neural network model comprises a feature extraction convolution layer, a candidate region extraction network, a candidate region pooling layer and a classifier, and the method comprises the following steps:

inputting the visual image information from the feature extraction convolution layer to the neural network model;

the characteristic extraction convolution layer is used for processing picture frames in the input visual image information into a characteristic diagram;

the candidate region extraction network is used for setting a preset number of candidate regions on the characteristic map scale and judging whether the candidate regions are positive samples or negative samples by utilizing the depth network, wherein the positive samples indicate that the candidate regions contain the hand images of the user, and the negative samples indicate that the candidate regions do not contain the hand images of the user; specifically, on the feature map scale, a plurality of candidate frames are set for each point in a plurality of sizes and a plurality of length-width ratios, and then, specifically, the candidate region extraction network is configured to set a total of 9 candidate frames with respective sizes of (0.5 ), (1,0.5), (0.5,1), (1,1), (2,1), (1,2), (2,2), (4,2), (2,4) for each point in a three-size and three-length-width ratios on the feature map scale, and then, a depth network is used to determine which regions are positive samples and which regions are negative samples.

The candidate area pooling layer is used for pooling the feature map judged as the positive sample area into an image with a fixed size;

the classifier is used for classifying the pooled images so as to judge whether the visual image information input to the neural network model contains the user hand image according to the classification result.

FIG. 4 is a schematic diagram of a neural network infrastructure for identifying a user's palm according to an embodiment of the present application; as shown in FIG. 4, the palm detection model employs a fast R-CNN based architecture for single class detection. The basic architecture of the Faster R-CNN is shown in the figure, and mainly comprises 4 parts of contents, a feature extraction convolutional layer (Conv Layers), a candidate Region extraction network (Region pro spatial Networks), a candidate Region Pooling layer (ROI Pooling), and a Classifier (Classifier), and the specific functions and actions of the basic architecture are the same as those of the above. After the palm is detected, an optical flow-based palm tracking algorithm can be adopted to improve the frame rate of palm tracking.

On the basis of any of the above embodiments, step S206: locating skeletal nodes and fingertip nodes of a user hand in the user hand image comprises:

step S2065: predefining a hand area with a preset number of skeleton nodes;

in an alternative, fig. 5 is a schematic diagram of predefined hand region skeleton nodes according to an embodiment of the present application, and as shown in fig. 5, the hand region includes 21 skeleton nodes, where the

nodes

4,8,12,16, and 20 are fingertip nodes. In the embodiment of the application, the skeleton nodes comprise fingertip nodes, all or most of the skeleton nodes are marked from the image according to the definition of the skeleton nodes to improve the calibration accuracy, and then the skeleton nodes at the end of the fingertip in the skeleton nodes are marked as the fingertip nodes.

Step S2066: marking each bone node of the preset number in the hand image;

step S2067: predicting a thermodynamic diagram for each bone node according to the hand image, wherein data of a pixel position in the thermodynamic diagram is the probability that the pixel position belongs to a certain bone node;

step S2068: determining the position of each bone node according to the probability;

step S2069: and determining a skeleton node at the end of the hand as the fingertip node.

FIG. 6a is a schematic diagram of detecting a hand image of a user in visual image information according to an embodiment of the present application; as shown in fig. 6a, the RGB camera captures a hand picture, and detects a palm region in the separated picture by a palm. The system analyzes 21 key bone nodes of the palm detected in the first step through a hand bone joint point calibration algorithm, and stores coordinates of the 21 nodes. When the hand of the user moves, the system tracks five key nodes of the fingertip of the hand, and updates the node coordinate queue in real time according to the change condition of the nodes.

On the basis of any of the above embodiments, step S2067: predicting a thermodynamic diagram for each bone node according to the hand image, wherein the probability that the data of a pixel position in the thermodynamic diagram belongs to a certain bone node comprises the following steps:

step S20672: processing the hand image by using a hourglass frame to obtain marking data corresponding to each pixel point

To do so byAnd a predicted position P of a skeletal node k in the hand image;

step S20674: generating a thermodynamic diagram of a bone node k based on the following formula

：

The thermodynamic diagram is a probability diagram corresponding to the hand image, the data at each pixel position in the probability diagram is the probability that the current pixel belongs to a certain bone node, wherein Σ is a thermodynamic diagram local action range control parameter, and a parameter value is set to be the square of a bandwidth parameter in a gaussian kernel function;

step S20676: based on the annotation data

And the thermodynamic diagram

And correcting the predicted position P to obtain a corrected predicted position L.

Generating a thermodynamic diagram of a hand joint point k

Wherein, the thermodynamic diagram is a probability diagram, and is consistent with the pixel composition of the picture, and the data at each pixel position is the probability that the current pixel is a certain joint:

(ii) a Wherein the content of the first and second substances,

And

by calculating the position of the corrected hand joint point in the image,

。

the bone joint point calibration model in the second step needs to output three types of data, the first is the x and y coordinates of 21 key points, the second is the value marking the existence probability of the hand, and the third is the identifier of whether the hand model is a left hand or a right hand.

The algorithm uses a model similar to CPM (conditional Point machine) to output the coordinates of 21 key points of the hand, and uses a multi-scale mode to enlarge the receptive field. The algorithm uses a model to quickly judge in real time whether a reasonable hand structure exists in a palm detection frame given by the palm detection model in the first step, and outputs a probability. If the probability is less than a certain threshold, the palm detection model of the first step will be restarted.

Given a specific formula for joint point prediction, the hand gesture of the user can be obtained by the following formula:

the formula I is as follows: labels obtained using the hourglass

Generating a thermodynamic diagram of the hand joint point k (the thermodynamic diagram is a probability diagram and is consistent with the pixel composition of a picture, but data on each pixel position is the probability that a current pixel is a certain joint, and further joint information is obtained based on probability analysis), wherein ∑ is a thermodynamic diagram local action range control parameter, and a parameter value is set to be the square of a bandwidth parameter in a Gaussian kernel function:

the formula II is as follows: thermodynamic diagram predicted according to formula one

Further, the position P of the hand joint point k in the image is obtained (further correction is performed based on the predicted position to obtain more accurate position information)

Then, as to how to obtain a specific motion, the gesture is first classified, and the position area of each joint is given for each class, and the current motion is determined as long as each joint is left in the corresponding balance.

On the basis of any of the above embodiments, step S208: judging whether the gesture operation of the user aims at the interactive surface according to the depth image information and the fingertip node comprises the following steps:

step S2081: determining a first depth value of the interaction surface from the depth image information;

in an alternative, the interactive surface may be a real physical surface or a virtual surface, and an interactive surface may be determined according to the depth value of each pixel point in the image, so that the depth value of the interactive surface may be known, or the interactive surface with a fixed position may be determined in advance, so that the depth value of the interactive surface is determined directly according to the fixed position.

Step S2082: determining a second depth value of the fingertip node according to the depth image information and the coordinate of the fingertip node;

in one alternative, the depth image information and the visual image information correspond in both physical space and time dimensions, i.e., the hand motion at the same time in the same area. And after the hand and skeleton nodes of the user are determined according to the visual image information, the hand and skeleton nodes of the user are correspondingly marked in the depth image information. FIG. 6b is a schematic diagram of calibrating an image of a user's hand at depth image information according to an embodiment of the present application; the labeling results are shown in FIG. 6 b.

Step S2083: when the difference value between the first depth value and the second depth value is smaller than a preset threshold value, determining that the gesture operation of the user aims at the interactive surface.

In one alternative, when the depth of the user's fingertip node is eventually substantially close to the depth of the interactive surface, it can be considered that the user's gesture operation is directed to the interactive surface. Specifically, when the user clicks on the interactive interface, the depth difference between the fingertip and the desktop reaches a set threshold (about 5-8 mm). When the method of RGB + depth camera is adopted, the depth camera continuously collects the depth information in the range, and the system judges whether the click event is a click event by judging the difference between the depth value corresponding to the coordinate of the key bone node of the fingertip and the depth value of the desktop. The first step is as follows: when the user presses the position to be interacted with a finger or other shielding object, the depth difference from the interaction surface reaches a threshold value, which is usually 6-10 mm. The second step is that: the system tracks the user fingertip, when the system finds that the depth of the corresponding coordinate of the user fingertip and the depth of the desktop reach a threshold value preset by the system, the system confirms that the click event is an effective click event, and stores the current click coordinate.

In an alternative scheme, when a binocular depth camera mode is adopted, a current scene is shot by the binocular camera, left and right pictures of the current scene are obtained, and information is preprocessed and slightly corrected. After the processed picture is obtained, the positions of all bone joint points of the hand are predicted and tracked by further utilizing a hand detection algorithm and a hand bone node calibration algorithm based on a curled neural network, so that the current hand posture of the user can be obtained, and the current hand posture is stored and waits for the next step for use. And calculating the depth of the scene object in real time by using two pictures acquired by the binocular camera, and if the difference between the distance between the joint point of the hand bone and the projection plane is judged to be less than 5mm, judging that the user moves as a pressing plane.

In an alternative scheme, the touch detection part defines fingertip parts of two hands as effective touch parts, coordinates of 5 fingertip part key points output by the hand key point detection model are compared with a depth map of a background of the touch surface, and the depth difference is within a certain threshold value, namely the effective touch is considered.

On the basis of any of the above embodiments, step S208: if yes, recognizing the user gesture according to the image information to execute user specified operation, wherein the user specified operation comprises the following steps:

step S2084: when the gesture operation of the user is determined to be directed to the interactive surface, determining a target object selected by the gesture operation of the user on the interactive surface, and reading multi-frame image information before and after the moment when the gesture operation of the user is performed on the interactive surface;

for example, when a finger of a user touches the interaction surface, i.e. the difference in depth between the user's fingertip and the interaction surface is less than a threshold value, the gesture operation of the user may be considered to be directed to the interaction surface. At this time, one of the target objects displayed on the interactive surface and available for the user to operate may be selected according to a position where the finger of the user touches the interactive surface. And then, acquiring multi-frame image information before the moment when the finger tip of the user clicks the interactive surface, and judging the complete gesture of the user according to the multi-frame image information.

Step S2085: determining the complete gesture action of the user according to the front and back multiframe image information;

for example, the user's full gesture motion may be a single gesture, such as opening the palm, making a fist, stroking a particular gesture, such as a V-shaped gesture, or a combination of multiple gestures, such as making a fist before stroking a particular gesture.

Step S2086: determining an operation instruction corresponding to the complete gesture action according to a corresponding relation between the pre-stored complete gesture action and the operation instruction;

for example, opening the palm represents a drag action, making a fist represents a zoom-out action, and a stroking V-shaped gesture represents a screenshot action.

Step S2087: and executing the operation instruction aiming at the target object selected by the user gesture operation so as to enable the projection device to change the user interaction interface.

Through the steps, the fact that the user clicks the interactive surface can be firstly determined, namely, the fact that the user has the operation intention is firstly determined, then the gesture action of the user is recognized, and the gesture misoperation of the user can be avoided. Step S2084 may determine a target object or a display element that the user specifically intends to operate on, and step S2085 may determine what operation the user intends to perform on the target object or the display element. For example, the user opens the palm first, i.e. the dragging action is to be performed, and then determines which target object, e.g. which icon or which window, is to be dragged according to the pressing position of the user; or the user firstly extends the index finger to represent a click action, then the click position is determined according to the position of the top end of the index finger touching the interactive surface, and then the function corresponding to the target object/control corresponding to the click position is executed; or the user strokes the V-shaped gesture to represent the screenshot action, and then the screenshot area is determined according to the position of the top end of the index finger touching the interactive surface.

In an alternative, after the click event is determined, click information of the previous frames is obtained from the storage for analyzing the specific action of the user, and the click information is also used as source data for the next analysis. Specifically, the method called by the current user is further calculated according to the current user state and the current user action, and changes needed to be made by the projector are transmitted to the projector, and the method specifically comprises the following steps: and analyzing the finger action of the user by using the multi-frame state information to further obtain the track information of the user.

In an alternative, the specific flow of the multi-frame status analysis method is briefly described with the camera frame rate being 50 frames, which is not limited by the present invention.

(1) When the system judges that the user action in the current picture is the pressing action, a duration needs to be judged, and here, it is assumed that the pressing action lasts for 100ms (i.e. 5 frames) to be a real pressing event, and further, a corresponding processing method is called.

(2) When only one frame of user is detected to be pressing behavior, the system starts query operation, firstly acquires the behavior type of the user in the previous frame, and if the user is identified to be pressing behavior of the same hand position, the system continues to acquire the influence of the previous frame. When an illegal action (pressing action or non-pressing action at a non-same position) is encountered, special treatment is carried out: skip this frame and read one frame forward.

(3) There are two cases at this time: 1. and if the previous frame is still illegal, the query is terminated, the current frame cannot be counted as a real pressing event, the multi-frame judgment is finished, and the system starts to wait for the user behavior of the next frame and judges. 2. The next previous frame can be identified as the pressing behavior of the same hand position, and the illegal behavior encountered before is marked as error data and processed as the pressing behavior of the same hand position.

(4) After inquiry and special processing, if the computing board judges that the same position has been pressed by five continuous frames, the computing board regards the pressing as a real pressing event, and the multi-frame judgment is finished to identify the hand joint points of the subsequent specific pressing positions.

In addition, it should be noted that, in the method flow, multi-frame judgment based on other data may also be mentioned, which is similar, but the process of acquiring the current behavior type of the user is different, but the principle of multi-frame comparison is the above principle.

In one alternative, the projection content of the current projector is obtained, and the function related to the pressing position is judged by combining the track information, and the function at the clicking position is identified. After learning the involved, functions, calls are made and call information is recorded if the function is a call of an event, and label information is recorded if it is a pure label addition. And transmitting the calling information or the mark information generated by the user action to the projector. And the projector updates the projection content according to the information transmitted by the computing board, wherein the projector acquires the information of the computing board in real time, compares the acquired information with the configuration of the information in the computing board to obtain the type of the information, directly draws corresponding content on the projection content if the information is marked, and acquires the projection interface needing to be updated by using the storage function of the computing board and displays the projection interface if the information is called.

On the basis of any of the above embodiments, the complete gesture actions at least include a first gesture action and a second gesture action, the first gesture action is used for marking a target object displayed on the user interaction interface, and the second gesture action is used for updating the target object displayed on the user interaction interface, where step S2087: executing the operation instruction to control the projection device to change the target object displayed on the user interaction interface comprises:

step S20871: when the complete gesture action of the user is determined to be the first gesture action, executing a first operation instruction corresponding to the first gesture action, and drawing a corresponding marker pattern at a target object designated by the user on the user interaction interface;

step S20872: and when the complete gesture action of the user is determined to be the second gesture action, executing a second operation instruction corresponding to the second gesture action, reading user interaction interface data needing to be updated, and projecting the user interaction interface data needing to be updated.

On the basis of any of the above embodiments, the interactive surface is a physical surface, and step S2081: determining the first depth value of the interactive surface according to the depth image information includes three ways of steps S2081A, S2081B and S2081C, wherein the method of S2081A and S2081B may be referred to as an overall modeling method, and the method is for the entire contact plane, so that the user can effectively click only when contacting the interactive plane of the desktop, i.e. the interactive surface is fixed; the method of S2081C may be referred to as local point set method, and is directed to a virtual interactive surface, and is not necessarily required to be a desktop. For example, when a user places a book on a desktop, the user can interact with the book plane as the interactive surface, i.e., the interactive surface depends on the user. The following describes in detail three modes of S2081A, S2081B, and S2081C.

On the basis of any one of the above embodiments, the S2081A method includes: step S2081a 1: modeling the physical surface, and determining a first depth value of the physical surface according to the depth image information and the modeling information.

In an alternative, the integral modeling method focuses on modeling of the whole interaction plane, and further judges the depth difference between the fingertip and the modeling plane.

On the basis of any one of the above embodiments, the S2081B method includes: this approach may be directed to a virtual interactive surface,

step S2081B 1: extracting a rolling window from the depth image information;

step S2081B 2: calculating the depth mean value and the depth standard deviation of each pixel point according to the rolling window;

step S2081B 3: judging whether the depth standard deviation is larger than a preset threshold value or not;

step S2081B 4: if yes, returning to the step of extracting a rolling window from the depth image information;

step S2081B 5: if not, modeling the virtual surface based on the depth average value, and determining the depth average value of each pixel point as a first depth value of the virtual surface.

Specifically, the above-mentioned ensemble modeling method uses dynamic modeling, i.e. when a change occurs on a plane (e.g. an object is placed on the plane), the model of the plane should be updated. Fig. 7 is a flowchart of a method for determining an interactive surface according to an embodiment of the present application, and as shown in fig. 7, the method includes the following specific steps:

s1: for each pixel of the depth map, a rolling window (the length of the rolling window used in the algorithm is 3 seconds) is taken out from the depth stream, and the mean value and the standard deviation of the depth of each pixel point are calculated through the rolling window.

S2, when the surface modeling is carried out based on the first rolling window, the mean value of each pixel forms the depth modeling of the background.

S3: the subsequent rolling window is used for dynamically updating the background modeling, when the standard deviation calculated by the subsequent rolling window exceeds a certain threshold value, the environment is changed greatly, the depth map cannot be updated, and the updating of the depth map cannot be carried out until the standard deviation falls back to the normal range again.

On the basis of any of the above embodiments, the interactive surface is a physical surface or a virtual surface, and the S2081B mode includes:

step S2081C 1: locating a first position of a fingertip node of the user's hand on the interaction surface;

step S2081C 2: defining a first local area containing the first location;

step S2081C 3: determining the depth mean value of each pixel point in the first local area according to the depth image information;

step S2081C 4: and determining the depth mean value of the first local area according to the depth mean value of each pixel point, and taking the depth mean value as the first depth value.

Specifically, unlike the simulation of the entire interactive plane by the entire modeling method, the local point set method only focuses on the background depth near the bone node corresponding to the fingertip. The specific process is as follows:

and S1, acquiring bone node coordinates of the fingertip from the hand bone node calibration algorithm, and dynamically acquiring the background in a certain area according to the fingertip coordinates and the length of the lower node. FIG. 8 is a schematic diagram of another method of determining an interaction surface according to an embodiment of the present application; as shown in FIG. 8, the system then performs a mean calculation on the set of points in the circular area as the depth of the background interaction surface.

And S2, sequencing the background depth, selecting the middle 70% of data points, and calculating the depth mean value of the 70% of data points to be used as the depth of the current interactive surface. As shown in FIG. 8, the system then averages the depths of the middle 70% of the point sets in the circular area to serve as the depth of the background interaction surface.

On the basis of any of the above embodiments, step S202: starting one or a group of pre-erected camera devices for image acquisition comprises the following steps:

step S202A: starting a pre-set camera device, wherein the camera device simultaneously collects visual image information and depth image information, and the camera device simultaneously collects the visual image information and the depth image information comprises: the method comprises the steps that firstly, visual image information is collected by the camera device and processed to generate corresponding thermodynamic diagram data, the data of a pixel position in the thermodynamic diagram data is the probability of the depth information of the pixel position, and the depth image information corresponding to the visual image information is determined according to the probability; or

Step S202B: starting a group of pre-erected camera devices, wherein the group of camera devices comprise RGB cameras and depth cameras, the RGB cameras are used for collecting visual image information, and the depth cameras are used for collecting depth image information; or

Step S202C: a set of camera device that erects in advance is started, a set of camera device includes binocular degree of depth camera, binocular degree of depth camera gathers visual image information and degree of depth image information simultaneously.

In an alternative solution, fig. 9 is a flowchart of another user gesture recognition method according to an embodiment of the present application, where a scheme of the method includes a visual information collection unit (a lens combination of an RGB camera and a 3D structured light camera or a binocular camera), a projection control unit (a projector), and a calculation analysis unit (a calculation board), as shown in fig. 9, and the method includes:

s1: and initializing the projector, and projecting the current operation interface information on the corresponding plane. The RGB camera and the depth camera start to work, and picture information flow is continuously input into the system.

S2: the hand of the user enters the visual field range of the RGB camera, the RGB camera continuously acquires hand pictures, 21 key bone joint points of the hand are analyzed through system processing, and five key bone joint points of fingertips are tracked.

S3: when the user clicks on the interactive interface, the depth difference between the fingertip and the desktop reaches a set threshold (about 6-10 mm). The depth camera continuously collects depth information in a range, and the system judges whether the depth information is a click event or not by judging the difference between the depth value corresponding to the coordinate of the key bone node of the fingertip and the depth value of the desktop.

S4: the computing board analyzes that a certain position is a click event of the user, acquires click information of previous frames from storage, and further analyzes the user action.

S5: and the computing board further calculates according to the current user state and action to obtain the method called by the current user, and simultaneously transmits the change required to be made by the projector to the projector.

S6: the projector updates the projection content according to the information transmitted by the computing board.

In an alternative, the flow of the user gesture recognition method may further include:

The first step is as follows: initializing the projector, focusing, performing trapezoidal correction, performing coincidence and calibration judgment of picture signals until the projection is clear, and displaying an operation interface in loading.

The second step is that: the RGB camera and the depth camera are initialized through calling of openCV and openNI respectively, and continuously input images to a system.

The third step: the projector acquires the current user setting from the computing board and projects a correct user operation interface.

The first step is as follows: the RGB camera acquires a hand picture, and a palm area in the picture at the separation position is detected through a palm.

The second step is that: the system analyzes 21 key bone nodes of the palm detected in the first step through a hand bone joint point calibration algorithm, and stores coordinates of the 21 nodes.

The third step: when the hand of the user moves, the system tracks five key nodes of the fingertip of the hand, and updates the node coordinate queue in real time according to the change condition of the nodes.

S3: when the user clicks on the interactive interface, the depth difference between the fingertip and the desktop reaches a set threshold (about 5-8 mm). When the method of RGB + depth camera is adopted, the depth camera continuously collects the depth information in the range, and the system judges whether the click event is a click event by judging the difference between the depth value corresponding to the coordinate of the key bone node of the fingertip and the depth value of the desktop.

The first step is as follows: when the user presses the position to be interacted with a finger or other shielding object, the depth difference from the interaction surface reaches a threshold value, which is usually 6-10 mm.

The second step is that: the system tracks the user fingertip, when the system finds that the depth of the corresponding coordinate of the user fingertip and the depth of the desktop reach a threshold value preset by the system, the system confirms that the click event is an effective click event, and stores the current click coordinate.

The first step is as follows: after the click event is judged, click information of the previous frames is acquired from the storage in order to analyze the specific action of the user, and the information is also used as source data for the next analysis. FIG. 8 shows the system determining the click event and making a corresponding response.

The first step is as follows: the computing board analyzes the finger action of the user by utilizing the multi-frame state information, further obtains the track information of the user, and shows a basic expansion flow-a supplementary flow about the specific flow of the multi-frame state.

The second step is that: and acquiring the projection content of the current projector, judging the function related to the pressing position by combining the track information, and identifying the function at the clicking position.

The third step: after learning the involved, functions, calls are made and call information is recorded if the function is a call of an event, and label information is recorded if it is a pure label addition.

The fourth step: the computing pad transmits the call information or the tag information generated by the user action to the projector.

The first step is as follows: the projector acquires the information of the computing board in real time, and compares the acquired information with the configuration of the information in the computing board to obtain the type of the information.

The second step is that: if the mark information is the corresponding content, the corresponding content is directly drawn on the projection content

The third step: and if the information is called, acquiring the projection interface needing to be updated by using the storage function of the computing board, and displaying the projection interface.

In one alternative, the default implementation hardware of the system is an RGB camera in combination with a 3D structured light depth camera. The RGB camera is responsible for shooting color images and providing hand photos for node analysis. The 3D structured light depth camera aims to realize the acquisition of depth information and further judge the depth difference between a user fingertip and an interactive surface.

In one alternative, a binocular depth camera is used to simultaneously operate the RGB and 3D structured light cameras, i.e., to simultaneously account for the acquisition of color and depth images. The advantage is that there is no minimum height from the desktop (the 3D structured light camera is at least about 40cm from the desktop) and color and depth data can be acquired simultaneously.

The detailed process comprises the following steps:

s1: initializing the binocular camera, acquiring hand posture information of a current user in real time, and performing hand delineation by using a computing board.

The first step is as follows: the method comprises the steps of shooting a current scene by using a binocular camera, obtaining left and right pictures of the current scene, and preprocessing and slightly correcting information.

The second step is that: after the processed picture is obtained, the positions of all bone joint points of the hand are predicted and tracked by further utilizing a hand detection algorithm and a hand bone node calibration algorithm based on a curled neural network, so that the current hand posture of the user can be obtained, and the current hand posture is stored and waits for the next step for use.

S2: when the computing board analyzes a certain frame of user action as a pressing plane, the computing board acquires action information of the previous frames of users from the storage, and further analyzes the user action.

The first step is as follows: and calculating the depth of the scene object in real time by using two pictures acquired by the binocular camera, and if the difference between the distance between the joint point of the hand bone and the projection plane is judged to be less than 5mm, judging that the user moves as a pressing plane.

The second step is that: after the pressing event is judged, in order to analyze the specific action of the user, the information of the action information of the user in the previous frames is obtained from the storage, and the information is also used as the source data of the next analysis.

S3: and the computing board further calculates according to the current user state and action to obtain the method called by the current user, and simultaneously transmits the change required to be made by the projector to the projector.

The first step is as follows: the computing board analyzes the specific hand motion of the user by using the multi-frame state information, and further obtains the track information of the plane pressing part of the user.

The second step is that: acquiring the projection content of the current projector, judging the functions related to the same position by combining the track information in a manner of acquiring the pressing event position (generally, a fingertip) judged by the computing board, and then identifying the functions of the current position.

The third step: after learning the function concerned, the call is made if the function is a call of an event and the call information is recorded, and the tag information is recorded if it is a pure tag addition.

The technical problem that this application mainly solved is the defect of the infrared touch-control aforesaid: limited by plane, equipment and distance. Through the three-dimensional touch technology, the depth camera can be matched based on the RGB camera (or only one binocular camera is used, the main method flow of the method is based on the RGB camera and the depth camera, the scheme of the binocular camera is expansion), joint point identification and depth identification can be combined, and the touch function which can be realized only by using double devices in the original infrared touch technology can be realized. Meanwhile, due to the superiority of depth identification, the performance superior to infrared touch control can be achieved under the remote control situation, the touch control operation of a user under the remote condition can be identified more favorably, and the influence of the distance on touch control identification is avoided. After the touch behavior of the user is recognized, the method can be applied to methods such as projection touch interaction and user action recognition, and innovation of the touch recognition field is realized.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the user gesture recognition method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method of the embodiments of the present application.

Example 2

According to the embodiment of the present application, there is also provided a user gesture recognition apparatus for implementing the user gesture recognition method, where the apparatus is implemented in a mobile terminal, a computer, a projector, and other devices in a software or hardware manner, and all of the devices may be implemented based on the computer terminal described in fig. 1.

As shown in fig. 10, the user gesture recognition apparatus 1000 includes:

the camera module 1002 comprises one or a group of pre-installed camera devices for image acquisition, wherein an interactive surface is arranged in a visual field range of one camera device or at least partially overlapped visual field ranges of a group of camera devices, so that a user can perform gesture operation on the interactive surface by using a hand;

a reading module 1004, configured to read image information acquired by the one or the group of cameras, where the image information at least includes visual image information and depth image information;

a first judging module 1006, configured to judge whether a user hand image is included in the visual field range according to the visual image information, and if yes, locate a skeleton node and a fingertip node of a user hand in the user hand image;

a second determining module 1008, configured to determine whether a gesture operation of the user is performed on the interactive surface according to the depth image information and the fingertip node, and if so, identify a gesture of the user according to the image information to perform a user-specified operation.

Here, it should be noted that the image capturing module 1002, the reading module 1004, the first determining module 1006, and the second determining module 1008 correspond to steps S202 to S208 in embodiment 1, and the four modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

The apparatus includes various corresponding functional modules for implementing the process steps in any one of the embodiments or optional manners in embodiment 1, which are not described in detail herein.

Example 3

Embodiments of the present application may provide a computing device, which may be any one of computer terminal devices in a computer terminal group. Optionally, in this embodiment, the computing device may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computing device may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the above-mentioned computing device includes one or more processors, a memory, and a transmission device. The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the methods and apparatus of the embodiments of the present application. The processor executes various functional applications and data processing by executing software programs and modules stored in the memory, namely, the method is realized.

In some examples, the memory may further include memory located remotely from the processor, which may be connected to the computing device 120 over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

In this embodiment, when the processor in the above-mentioned computing device runs the stored program code, the following method steps may be executed: starting one or a group of pre-erected camera devices to acquire images, wherein an interactive surface is arranged in the visual field range of one camera device or at least partially overlapped visual field ranges of a group of camera devices, so that a user can perform gesture operation on the interactive surface by using a hand; reading image information acquired by the one or a group of camera devices, wherein the image information at least comprises visual image information and depth image information; judging whether a user hand image is contained in the visual field range or not according to the visual image information, if so, positioning skeleton nodes and fingertip nodes of a user hand in the user hand image; and judging whether the gesture operation of the user aims at the interactive surface or not according to the depth image information and the fingertip node, and if so, identifying the gesture of the user according to the image information to execute the operation specified by the user.

Further, in this embodiment, when the processor in the computing device runs the stored program code, any method step listed in embodiment 1 may be executed, which is not described in detail herein for reasons of brevity.

Example 4

Embodiments of the present application also provide a storage medium. Optionally, in this embodiment, the storage medium may be configured to store program codes executed by the user gesture recognition method.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: starting one or a group of pre-erected camera devices to acquire images, wherein an interactive surface is arranged in the visual field range of one camera device or at least partially overlapped visual field ranges of a group of camera devices, so that a user can perform gesture operation on the interactive surface by using a hand; reading image information acquired by the one or a group of camera devices, wherein the image information at least comprises visual image information and depth image information; judging whether a user hand image is contained in the visual field range or not according to the visual image information, if so, positioning skeleton nodes and fingertip nodes of a user hand in the user hand image; and judging whether the gesture operation of the user aims at the interactive surface or not according to the depth image information and the fingertip node, and if so, identifying the gesture of the user according to the image information to execute the operation specified by the user.

Further, in this embodiment, the storage medium is configured to store the program code for executing any one of the method steps listed in embodiment 1, which is not described in detail herein for brevity.

Example 5

According to an embodiment of the present application, there is also provided a user gesture recognition system, which can execute the user gesture recognition method provided in embodiment 1, and the system includes: the system comprises one or a group of pre-erected camera devices, a touch screen and a touch screen, wherein the pre-erected camera device or the group of camera devices is used for acquiring images, and an interaction surface is arranged in a visual field range of one camera device or at least partially overlapped visual field ranges of the group of camera devices so that a user can perform gesture operation on the interaction surface by using a hand; the pre-erected projection device is used for projecting a user interaction interface on the interaction surface, and at least one target object which can be operated by a user is displayed on the user interaction interface; a memory storing instructions; a processor executing the instructions to perform: the image information acquired by the one or the group of camera devices is read, wherein the image information at least comprises visual image information and depth image information, the first judgment module is used for judging whether a hand image of a user is contained in the visual field range or not according to the visual image information, if yes, skeleton nodes and fingertip nodes of the hand of the user are positioned in the hand image of the user, the second judgment module is used for judging whether gesture operation of the user is directed at the interactive surface or not according to the depth image information and the fingertip nodes, and if yes, the gesture of the user is recognized according to the image information so as to execute user specified operation.

Further, in this embodiment, the processor may execute the instructions to implement any one of the method steps listed in embodiment 1, which is not described in detail herein.

Claims

1. A method for recognizing a user gesture, the method comprising:

starting one or a group of pre-erected camera devices to acquire images, wherein an interactive surface is arranged in the visual field range of one camera device or at least partially overlapped visual field ranges of a group of camera devices, so that a user can perform gesture operation on the interactive surface by using a hand;

reading image information acquired by the one or a group of camera devices, wherein the image information at least comprises visual image information and depth image information;

judging whether a user hand image is contained in the visual field range or not according to the visual image information, if so, positioning skeleton nodes and fingertip nodes of a user hand in the user hand image;

and judging whether the gesture operation of the user aims at the interactive surface or not according to the depth image information and the fingertip node, and if so, identifying the gesture of the user according to the image information to execute the operation specified by the user.

2. The method of claim 1, further comprising:

and starting the projection device to project a user interaction interface on the interaction surface, wherein at least one target object which can be operated by a user is displayed on the user interaction interface.

3. The method of claim 1, wherein determining whether a user hand image is included in the field of view based on the visual image information comprises:

establishing a neural network model, wherein the neural network model comprises a feature extraction convolution layer, a candidate region extraction network, a candidate region pooling layer and a classifier;

the candidate region extraction network is used for setting a preset number of candidate regions on the feature map and judging the candidate regions to be positive samples or negative samples by utilizing the depth network, wherein the positive samples indicate that the candidate regions contain the hand images of the user, and the negative samples indicate that the candidate regions do not contain the hand images of the user;

the candidate region pooling layer is used for pooling the feature map judged as the positive sample into an image with a fixed size;

4. The method of claim 3, wherein locating skeletal nodes and fingertip nodes of a user hand in the user hand image comprises:

predefining a hand area with a preset number of skeleton nodes;

marking each bone node of the preset number in the hand image;

predicting a thermodynamic diagram for each bone node according to the hand image, wherein data of a pixel position in the thermodynamic diagram is the probability that the pixel position belongs to a certain bone node;

determining the position of each bone node according to the probability;

and determining a skeleton node at the end of the hand as the fingertip node.

5. The method of claim 4, wherein predicting a thermodynamic diagram for each skeletal node from the hand image, wherein the data for a pixel location in the thermodynamic diagram is a probability that the pixel location belongs to a skeletal node comprises:

processing the hand image by using a hourglass frame to obtain marking data corresponding to each pixel point

And a predicted position P of a skeletal node k in the hand image;

generating a thermodynamic diagram of a bone node k based on the following formula

：

Wherein the thermodynamic diagram is a probability diagram corresponding to the hand image, the data at each pixel position in the probability diagram being the probability that the current pixel belongs to a certain skeletal node, wherein,

setting a parameter value as the square of a bandwidth parameter in a Gaussian kernel function for a control parameter of a thermodynamic diagram local action range;

based on the annotation data

And the thermodynamic diagram

6. The method of claim 1, wherein determining whether the gesture operation of the user is directed to the interactive surface according to the depth image information and the fingertip node comprises:

determining a first depth value of the interaction surface from the depth image information;

determining a second depth value of the fingertip node according to the depth image information and the coordinate of the fingertip node;

when the difference value between the first depth value and the second depth value is smaller than a preset threshold value, determining that the gesture operation of the user aims at the interactive surface.

7. The method of claim 2, wherein if yes, recognizing a user gesture according to the image information to perform a user-specified operation comprises:

when the gesture operation of the user is determined to be directed to the interactive surface, determining a target object selected by the gesture operation of the user on the interactive surface, and reading multi-frame image information before and after the moment when the gesture operation of the user is performed on the interactive surface;

determining the complete gesture action of the user according to the front and back multiframe image information;

determining an operation instruction corresponding to the complete gesture action according to a corresponding relation between the pre-stored complete gesture action and the operation instruction;

and executing the operation instruction aiming at the target object selected by the user gesture operation so as to enable the projection device to change the user interaction interface.

8. The method of claim 7, wherein the complete gesture action comprises at least a first gesture action and a second gesture action, the first gesture action is used for marking a target object displayed on the user interaction interface, the second gesture action is used for updating the target object displayed on the user interaction interface, and wherein executing the operation instruction to control a projection device to change the target object displayed on the user interaction interface comprises:

when the complete gesture action of the user is determined to be the first gesture action, executing a first operation instruction corresponding to the first gesture action, and drawing a corresponding marker pattern at a target object designated by the user on the user interaction interface;

and when the complete gesture action of the user is determined to be the second gesture action, executing a second operation instruction corresponding to the second gesture action, reading user interaction interface data needing to be updated, and projecting the user interaction interface data needing to be updated.

9. The method of any of claims 6-8, wherein the interactive surface is a physical surface, and wherein determining a first depth value of the interactive surface from the depth image information comprises:

modeling the physical surface, and determining a first depth value of the physical surface according to the depth image information and the modeling information.

10. The method according to any of claims 6-8, wherein the interactive surface is a virtual surface, and wherein determining a first depth value of the interactive surface from the depth image information comprises:

extracting a rolling window from the depth image information;

calculating the depth mean value and the depth standard deviation of each pixel point according to the rolling window;

judging whether the depth standard deviation is larger than a preset threshold value or not;

if yes, returning to the step of extracting a rolling window from the depth image information;

if not, modeling is carried out on the virtual surface based on the depth average value, and the depth average value of each pixel point is determined as a first depth value of the virtual surface.

11. The method according to any of claims 6-8, wherein the interactive surface is a physical surface or a virtual surface, and wherein determining a first depth value of the interactive surface from the depth image information comprises:

locating a first position of a fingertip node of the user's hand on the interaction surface;

defining a first local area containing the first location;

determining the depth mean value of each pixel point in the first local area according to the depth image information;

and determining the depth mean value of the first local area according to the depth mean value of each pixel point, and taking the depth mean value as the first depth value.

12. The method of claim 1, wherein initiating a pre-established camera or group of cameras for image acquisition comprises:

starting a pre-set camera device, wherein the camera device simultaneously collects visual image information and depth image information, and the camera device simultaneously collects the visual image information and the depth image information comprises: the method comprises the steps that firstly, visual image information is collected by the camera device and processed to generate corresponding thermodynamic diagram data, the data of a pixel position in the thermodynamic diagram data is the probability of the depth information of the pixel position, and the depth image information corresponding to the visual image information is determined according to the probability; or

Starting a group of pre-erected camera devices, wherein the group of camera devices comprise RGB cameras and depth cameras, the RGB cameras are used for collecting visual image information, and the depth cameras are used for collecting depth image information; or

A set of camera device that erects in advance is started, a set of camera device includes binocular degree of depth camera, binocular degree of depth camera gathers visual image information and degree of depth image information simultaneously.

13. A user gesture recognition apparatus, comprising:

the camera module comprises one or a group of pre-erected camera devices and is used for acquiring images, wherein an interactive surface is arranged in the visual field range of one camera device or at least partially overlapped visual field ranges of a group of camera devices, so that a user can perform gesture operation on the interactive surface by using a hand;

the reading module is used for reading image information acquired by the one or the group of camera devices, wherein the image information at least comprises visual image information and depth image information;

the first judgment module is used for judging whether the visual field range contains a user hand image or not according to the visual image information, and if yes, positioning skeleton nodes and fingertip nodes of a user hand in the user hand image;

and the second judging module is used for judging whether the gesture operation of the user aims at the interactive surface according to the depth image information and the fingertip node, and if so, identifying the gesture of the user according to the image information to execute the specified operation of the user.

14. A user gesture recognition system, the system comprising:

the system comprises one or a group of pre-erected camera devices, a touch screen and a touch screen, wherein the pre-erected camera device or the group of camera devices is used for acquiring images, and an interaction surface is arranged in a visual field range of one camera device or at least partially overlapped visual field ranges of the group of camera devices so that a user can perform gesture operation on the interaction surface by using a hand;

the pre-erected projection device is used for projecting a user interaction interface on the interaction surface, and at least one target object which can be operated by a user is displayed on the user interaction interface;

a memory storing instructions;

a processor executing the instructions to perform:

reading image information acquired by the one or a group of camera devices, wherein the image information at least comprises visual image information and depth image information,

a first judging module, configured to judge whether the visual field range includes a user hand image according to the visual image information, if so, locate a skeleton node and a fingertip node of a user hand in the user hand image,

15. A storage medium, characterized in that the storage medium comprises a stored program, wherein the device on which the storage medium is located is controlled to perform the method according to any of claims 1-12 when the program is run.

16. A computing device comprising a processor, wherein the processor is configured to execute a program, wherein the program when executed performs the method of any of claims 1-12.