CN115565241A

CN115565241A - Gesture recognition object determination method and device

Info

Publication number: CN115565241A
Application number: CN202111034365.2A
Authority: CN
Inventors: 黄允臻; 王浩; 李冬虎; 冷继南; 常胜
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-06-30
Filing date: 2021-09-03
Publication date: 2023-01-03

Abstract

The application discloses a gesture recognition object determining method and device, and belongs to the field of computer vision. Firstly, the equipment determines one or more potential users in a shooting area according to a multi-frame first image obtained by shooting the shooting area by a camera, wherein the potential users meet the following conditions: each frame of the plurality of frames of first images includes a facial image of the potential user. Then, the device determines the hand motion of the potential user according to the to-be-identified area corresponding to the potential user in the multi-frame first image, wherein the to-be-identified area corresponding to the potential user comprises the hand image of the potential user. And finally, determining a target user in the one or more potential users as a gesture recognition object, wherein the hand motion of the target user is matched with a preset gesture. According to the gesture recognition method and device, the gesture recognition object can be automatically determined based on the image shot by the camera, the air gesture operation of the gesture recognition object can be further achieved, the gesture recognition method and device are suitable for gesture recognition under various scenes, especially multi-user scenes, and the implementation mode is simple.

Description

Gesture recognition object determination method and device

The present application claims priority from chinese patent application No. 202110736357.6 entitled "method, apparatus and system for gesture recognition" filed 30/06/2021, the entire contents of which are incorporated herein by reference.

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a method and an apparatus for determining a gesture recognition object.

Background

In the field of computer vision, gesture recognition is a very important human-computer interaction mode. The gesture recognition technology is that various sensors are utilized to model the shapes, displacements and the like of hands (arms) to form an information sequence, and then the information sequence is converted into a corresponding instruction to control and realize certain operations.

Because gestures of multiple users can be recognized in a multi-user scene, how to determine a gesture recognition object in the multiple users is a key for realizing accurate gesture control.

Disclosure of Invention

The application provides a gesture recognition object determining method and device.

In a first aspect, a gesture recognition object determination method is provided. The method may be applied to a general purpose computing device. The method comprises the following steps: according to a plurality of frames of first images obtained by shooting a shooting area by a camera, determining one or more potential users in the shooting area, wherein the potential users meet the following conditions: each frame of the plurality of frames of first images includes a facial image of the potential user. Determining the hand motion of the potential user according to the to-be-identified area corresponding to the potential user in the multi-frame first image, wherein the to-be-identified area corresponding to the potential user comprises the hand image of the potential user. And determining a target user in the one or more potential users as a gesture recognition object, wherein the hand motion of the target user is matched with the preset gesture.

According to the gesture recognition method and device, a user who has a face image in each frame of image of multiple frames of images shot by a camera and has a hand motion matched with a preset gesture is determined as a gesture recognition object in a shooting area of the camera. The gesture recognition object can be automatically determined based on the image shot by the camera, gesture recognition can be further performed on the gesture recognition object to achieve space gesture operation, the gesture recognition method is suitable for gesture recognition under various scenes, especially multi-user scenes, and the implementation mode is simple.

Optionally, after determining a target user of the one or more potential users as the gesture recognition object, the method further includes: the method comprises the steps of obtaining a to-be-recognized area corresponding to a target user in a plurality of frames of second images, wherein the to-be-recognized area corresponding to the target user comprises a hand image of the target user, and the plurality of frames of second images are obtained by shooting a shooting area by a camera after the target user is determined as a gesture recognition object. And performing gesture recognition on the target user according to the to-be-recognized area corresponding to the target user in the multi-frame second image. Here, it can be understood that: and only acquiring the area to be recognized corresponding to the target user in the multi-frame second image, and only performing gesture recognition on the target user.

In the method and the device, after the target user is determined as the gesture recognition object, only the target user serving as the gesture recognition object is subjected to gesture recognition within a period of time, and gesture recognition is not performed on other users except the target user, namely, the gesture of one user is locked and recognized within a period of time, so that the problem that accurate gesture control cannot be realized due to mutual interference of gestures among the users can be solved.

Optionally, the preset gesture includes an initial portion of the gesture to be recognized, and the implementation process of performing gesture recognition on the target user according to the region to be recognized corresponding to the target user in the multiple frames of second images includes: and judging whether the target user executes the gesture to be recognized or not according to the region to be recognized corresponding to the target user in the multi-frame first image and the region to be recognized corresponding to the target user in the multi-frame second image.

In the application, the initial part of the gesture to be recognized serves as the preset gesture for judging the gesture recognition object, when the user needs to perform the space gesture operation, the gesture to be recognized is directly executed in the shooting area of the camera, other specific awakening gestures are not needed to be executed to start the gesture recognition function of the equipment, the gesture recognition object is determined under the condition that the user does not sense, the user operation can be simplified, and the user experience is improved.

Optionally, the implementation process of obtaining the to-be-identified region corresponding to the target user in the multiple frames of second images includes: and determining the position of the face image of the target user in the second image according to the stored face information of the target user. And determining the area to be recognized corresponding to the target user in the second image according to the position of the face image of the target user in the second image.

In the application, after the target user is determined as the gesture recognition object, the face information of the target user can be stored, so that the hand action of the target user is associated through the face information of the target user, the hand tracking of the target user is realized, and the gesture recognition of the target user is further realized.

Optionally, when the number of images, which do not include the face image of the target user, in the image obtained by shooting the shooting area by the camera after the target user serves as the gesture recognition object exceeds a number threshold, or when the duration of the target user serving as the gesture recognition object reaches a duration threshold, finishing taking the target user as the gesture recognition object.

In the application, at most one gesture recognition object can be determined at the same moment, and the gesture recognition object may change along with the time, so that the requirement that the gesture recognition object is flexible and changeable in an application scene can be met by setting some conditions for finishing taking a target user as the gesture recognition object.

Optionally, after determining one or more potential users in the shooting area, the method further includes: the face image position of the potential user in the first image is determined. And determining the region to be identified corresponding to the potential user in the first image according to the position of the face image of the potential user in the first image.

Optionally, the implementation process of determining a target user of the one or more potential users as a gesture recognition object includes: and when a plurality of potential users with hand motions matched with the preset gestures exist in the shooting area, taking the potential user closest to the camera as a target user.

Optionally, after determining one or more potential users in the shooting area, the method further includes: the distance of the potential user from the camera is acquired. When the distance from the potential user to the camera exceeds a distance threshold, a distance prompt is output, and the distance prompt is used for prompting the potential user to approach the camera.

If the potential user is far away from the camera, the human body image of the potential user in the image shot by the camera is small, and the hand details of the user cannot be reflected, so that subsequent misjudgment on the hand action of the user can be caused.

Optionally, the implementation process of acquiring the distance from the potential user to the camera includes: and determining the distance from the potential user to the camera according to the focal length of the camera, the distance between the potential user and the eyes in the first image and the distance between the potential user and the eyes of the preset user.

For example, assuming that the focal length of the camera is f, the distance between the eyes of the user in the image (i.e., the image plane) taken by the camera and containing the front face image of the user is M, the distance between the eyes of the user is preset to be K, and assuming that the distance from the user to the camera is d, the following can be obtained according to the similar triangle principle: m/f = K/d, from which the distance d = (K f)/M of the user to the camera can be derived.

In the method, the camera is not limited to be a monocular camera or a binocular camera or is integrated with a depth sensor, the distance from a user to the camera can be determined based on the similar triangle principle, the calculation mode is simple, and the implementation cost is low.

Optionally, determining an implementation process of a hand motion of the potential user according to the to-be-identified region corresponding to the potential user in the multi-frame first image, including: and respectively carrying out key point detection on the areas to be identified corresponding to the potential users in the multi-frame first images to obtain a plurality of groups of hand key point information of the potential users. And determining the hand actions of the potential users according to the plurality of groups of hand key point information of the potential users.

Optionally, the to-be-identified region corresponding to the potential user further includes an elbow image of the potential user, and the implementation process of the hand motion of the potential user is determined according to the to-be-identified region corresponding to the potential user in the multi-frame first image, and includes: and respectively carrying out key point detection on the areas to be identified corresponding to the potential users in the multi-frame first images to obtain multiple groups of key point information of the hands and the elbows of the potential users. And determining the hand actions of the potential users according to the multiple groups of hand and elbow key point information of the potential users.

When a plurality of hand key points of a user are too compact or some hand key points are missing in a detection result, misjudgment or missed judgment of the hand action of the user may be caused.

Optionally, the face image is a front face image. That is, the potential user satisfies: each of the plurality of frames of first images includes a frontal image of the potential user.

Because the user usually faces the camera when the user performs the space gesture operation, the method and the device can also exclude the user facing the camera inside the shooting area, and only determine the potential user among the users facing the camera, so that the misjudgment probability of the gesture recognition object can be reduced.

In a second aspect, a gesture recognition object determination apparatus is provided. The apparatus comprises a plurality of functional modules that interact to implement the method of the first aspect and its embodiments described above. The functional modules can be implemented based on software, hardware or a combination of software and hardware, and the functional modules can be arbitrarily combined or divided based on specific implementations.

In a third aspect, a gesture recognition object determination device is provided, including: a processor and a memory;

the memory for storing a computer program, the computer program comprising program instructions;

the processor is configured to invoke the computer program to implement the method in the first aspect and the embodiments thereof.

In a fourth aspect, a computer-readable storage medium is provided, on which instructions are stored, which when executed by a processor, implement the method of the first aspect and its embodiments described above.

In a fifth aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of the first aspect and its embodiments.

In a sixth aspect, a chip is provided, which comprises programmable logic circuits and/or program instructions, and when the chip is run, the method of the first aspect and its embodiments is implemented.

Drawings

Fig. 1 is a schematic view of an application scenario involved in a gesture recognition object determination method provided in an embodiment of the present application;

FIG. 2 is a schematic flowchart of a gesture recognition object determination method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a distance measurement principle provided in an embodiment of the present application;

FIG. 4 is a schematic view of an image provided by an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a distribution of key points of a hand according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a gesture recognition object determination apparatus according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of another gesture recognition object determination apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of another gesture recognition object determination apparatus provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of still another gesture recognition object determination apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of still another gesture recognition object determination apparatus according to an embodiment of the present application;

fig. 11 is a block diagram of a gesture recognition object determination device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, the following detailed description of the embodiments of the present application will be made with reference to the accompanying drawings.

With the development of computer vision technology, products in various forms such as mobile phones, electronic screens, virtual Reality (VR) devices, and the like are diversified, and the demand for interaction between people and machines is increasing. Because the gesture can express abundant information in a non-contact mode, the gesture recognition can be widely applied to products such as human-computer interaction, smart phones and smart televisions. Particularly, the gesture recognition technology based on vision does not need to wear an additional sensor on a hand to increase marks, is good in convenience, and has wide application prospects in the aspects of man-machine interaction and the like. The gestures mentioned in the application all refer to non-contact gestures, namely, air gestures.

At present, the demand of space gesture operation on the display equipment exists in a plurality of scenes. For example, in a conference room scene, a participant may perform an operation such as page up, page down, page left, page right, screen capture, etc. on a display screen on a conference terminal at intervals. For another example, in a family scene, the family member may perform an operation of separating space, such as fast forward, rewind, volume up, volume down, pause, etc., on the playing picture on the smart television. Also for example, in a classroom setting, a teacher or student may perform spaced gesture operations such as scroll up, scroll down, etc. on the display content on the display device.

However, in these scenes, there are usually multiple users in front of the display device, and when performing gesture recognition based on images acquired by a camera, gestures of the multiple users are easily recognized, and the display device may not be able to specifically distinguish which user is performing gesture operation in the air, and the gestures between the users may interfere with each other, so that the display device may not be able to perform gesture control accurately.

Based on this, the present application proposes a scheme of determining a gesture recognition object: the gesture recognition method includes the steps that face detection is conducted on multi-frame images shot by a camera to identify potential users in a shooting area, hand motions of the potential users are judged to determine gesture recognition objects in the potential users, and specifically, users, of which the face images exist in each frame of the multi-frame images shot by the camera and the hand motions are matched with preset gestures, can be determined as the gesture recognition objects in the shooting area of the camera. According to the gesture recognition method and device, the gesture recognition object can be automatically determined based on the image shot by the camera, gesture recognition can be further carried out on the gesture recognition object to achieve space gesture operation, the gesture recognition method and device are suitable for gesture recognition under various scenes, especially multi-user scenes, and the implementation mode is simple. In addition, in the process of gesture recognition of the gesture recognition object, gestures of other users except the gesture recognition object cannot be recognized, and the problem that gesture control cannot be accurately performed due to mutual interference of the gestures of the users can be avoided.

Optionally, based on consideration of operation habits of the user, since the user usually faces the camera when performing the space gesture operation, based on this, the application scheme may further exclude the user facing the camera inside the shooting area, and determine the gesture recognition object only among the users facing the camera, specifically, the user whose hand motion matches the preset gesture and whose face image exists in each frame of the multi-frame images shot by the camera may be determined as the gesture recognition object in the shooting area of the camera. This can reduce the probability of erroneous judgment of the gesture recognition object. In order to improve the accuracy of determination of the gesture recognition object, it is also possible to explicitly specify in an operation manual of a display device supporting a gesture control function: the user needs to face the camera when performing the spaced gesture operation. The facing camera referred to in the application is not a face that is completely opposite to the camera, but may have a deviation within a set range. The face of the person is completely opposite to the camera, and the connecting line of the two eyes can be parallel to the imaging plane of the camera. If the face deflection angle when the face completely faces the camera is 0 °, facing the camera means that the deflection angle of the face relative to the camera is within a certain range in the present application. That is, if the face deflection angle of the user is within a certain deflection angle range, the user is considered to be facing the camera. For example, a user whose face deflection angle is in the range of-30 ° to 30 ° may be regarded as facing the camera, where the face deflection angle range is used as an exemplary illustration only, and the present application may set the face deflection angle range used for determining whether the user faces the camera according to actual requirements. The face of a user facing a camera is referred to herein as the front face.

The gesture recognition object determination method provided by the embodiment of the application can be applied to general computing equipment. The general purpose computing device may be a display device or a post-processing end connected to a display device. Wherein the display device supports gesture control functionality. The display device is internally provided with a camera, or the display device is connected with an external camera. The camera is used for shooting a shooting area to obtain an image. The display device or a post-processing end connected with the display device is used for determining a gesture recognition object in a shooting area according to an image shot by the camera and further performing gesture recognition on the gesture recognition object so as to respond to the gesture operation in the air. The deployment orientation of the camera generally coincides with the deployment orientation of the display device, and the shooting area of the camera generally includes the area toward which the display surface of the display device is facing. The post-processing end can be a server, a server cluster formed by a plurality of servers, a cloud computing platform and the like.

The gesture recognition object determination method provided by the embodiment of the application can be applied to various scenes. In a meeting room scenario, the display device may be a conference terminal such as a large screen or an electronic whiteboard. In a home or classroom setting, the display device may be a smart television, projection device, or AR device, among others.

For example, fig. 1 is a schematic view of an application scenario related to a gesture recognition object determination method provided in an embodiment of the present application. The application scenario is a conference room scenario. As shown in fig. 1, the application scenario includes a conference terminal, and a camera is built in the conference terminal. The conference terminal is installed on the wall surface. The shooting area of the camera comprises a conference table and a plurality of conference participants. During the period that the conference terminal starts the gesture control function, the camera continuously shoots the shooting area, and the conference terminal or a post-processing terminal (not shown in the figure) connected with the conference terminal processes the image shot by the camera to determine whether a gesture recognition object exists in the shooting area.

The method flow of the embodiments of the present application is explained below.

Fig. 2 is a schematic flowchart of a gesture recognition object determination method according to an embodiment of the present disclosure. As shown in fig. 2, the method includes:

step 201, determining one or more potential users in a shooting area according to a multi-frame first image obtained by shooting the shooting area by a camera.

The potential users satisfy: each frame of the plurality of frames of first images includes a facial image of the potential user. That is, a user having a face image in each of the plurality of first images is set as a potential user in the shooting area. The number of frames used to determine the images of the potential users is configured in advance, for example, the first image of the plurality of frames may be a 3-frame image, a 5-frame image, or a 10-frame image, and the number of frames of the first image used herein is not limited in the embodiment of the present application.

Optionally, the face detection is performed on the multiple frames of first images respectively to obtain a face image in each frame of first image. And then judging which facial images in different first images belong to the same user. And finally, determining which facial images of the users exist in each frame of first images in the multiple frames of first images, and further obtaining potential users in the shooting area.

For example, face detection may be performed using a multi-task cascaded convolutional neural network (MTCNN). MTCNN comprises a cascade of three networks: a proposal network (P-Net), a refinement network (R-Net) and an output network (O-Net). The MTCNN-based process for detecting the face of the image comprises the following steps:

first, an image pyramid is built for an input original image. The image pyramid includes a plurality of images of different sizes obtained by scaling the original image. Because the original image may have face images with different sizes, the face images with different sizes in the original image can be detected under the uniform size by establishing the image pyramid, and the robustness of the network to the face images with different sizes is enhanced.

Secondly, inputting the image pyramid into three cascaded networks (P-Net, R-Net and O-Net), finishing the detection of the face image in the image from coarse to fine through the three cascaded networks, and finally outputting a face detection result. The P-Net is used for regressing a plurality of detection frames from an input image, mapping the regressed detection frames into an original image, and removing a part of redundant frames by a non-maximum suppression (NMS) algorithm to obtain a preliminary face detection result. And the R-Net is used for further refining and filtering the face detection result output by the P-Net. And the O-Net is used for further refining and filtering the face detection result output by the R-Net and outputting the final face detection result.

Optionally, the face detection result obtained based on the MTCNN includes face detection frame information and face key point information corresponding to each detected face image. The face detection frame information may include coordinates of an upper left corner and coordinates of a lower right corner of the face detection frame, and the face image is located in the face detection frame. The face keypoint information may include coordinates of a plurality of face keypoints, which may include left eye, right eye, nose, left mouth corner, right mouth corner, and the like.

After the face detection frame information in the multiple frames of first images is obtained, intersection ratio (IoU) values of the face detection frames in every two adjacent frames of the first images in the multiple frames of first images can be respectively calculated. And determining the face images belonging to the same user in the two adjacent first images based on the IoU value. The IoU value here may be equal to a ratio of an intersection area and a phase-parallel area of the face detection frames in the two adjacent first images after the two adjacent first images are superimposed. The value range of the IoU value is 0 to 1. For example, two adjacent frames of first images are an image a and an image B, respectively, and when the IoU values of a first face detection frame in the image a and a second face detection frame in the image B are greater than a preset threshold, it may be determined that the face image in the first face detection frame and the face image in the second face detection frame belong to the same user. If the multi-frame first image includes a plurality of frames of images continuously acquired by the camera, the preset threshold may be a larger value, for example, may be 0.8. If the first images of multiple frames include multiple images acquired by the camera every other frame, the preset threshold may be a smaller value, for example, may be 0.6. The specific value of the preset threshold is not limited in the embodiment of the application.

Or after the facial images in the multiple frames of first images are obtained, which facial images in different first images belong to the same user can also be judged in a manner of calculating the similarity of the facial images.

In the embodiment of the application, the same user identification can be adopted to identify the face images belonging to the same user in different images, and different user identifications can be adopted to identify the face images belonging to different users in the same image. And if each frame of first images in the plurality of frames of first images comprises a face image identified by the same user identifier, determining the user represented by the user identifier as a potential user. The user id used here only needs to distinguish different users, and for example, numbers, characters or other identifiers may be used as the user id. According to the scheme, the user identity does not need to be recognized, but only different users can be distinguished, gesture recognition objects possibly existing in scenes do not need to be preset, and the method and the device can be flexibly applied to various multi-user scenes, especially multi-user scenes with various user groups, such as public meeting rooms.

Because the user usually faces the camera when the user performs the space gesture operation, the embodiment of the application can also exclude the user facing the camera inside the shooting area, and only determine the potential user among the users facing the camera, so that the misjudgment probability of the gesture recognition object can be reduced.

Optionally, in a condition that each frame of the first images of the multiple frames of the first images satisfied by the potential user includes a face image in the face image of the potential user, the face image is a front face image, that is, the potential user satisfies: each of the plurality of frames of first images includes a frontal image of the potential user. That is, a user having a front face image in each of the plurality of first images is regarded as a potential user in the photographing region.

In one implementation, the face image may be input into a classification model obtained by pre-training to obtain a classification result output by the classification model, where the classification result indicates whether the input face image belongs to a front face or a side face. The classification model can be obtained by training in a supervised learning mode based on a training sample set. The training sample set may include a large number of sample face images, and each sample face image is labeled with a label indicating whether the sample face image belongs to a front face or a side face.

For example, a lightweight deep neural network, mobilenetv2, may be used to build a binary model. The mobilenetv2 is often applied to classification tasks in mobile terminals such as mobile phones. After the face image is input into the mobilenetv2, the mobilenetv2 outputs a classification result. There are two kinds of classification results, which can be represented by 0 and 1, respectively. Where 0 may indicate that the input face image belongs to a side face, and 1 may indicate that the input face image belongs to a front face.

In another implementation manner, a face deflection angle range may be preset, and if the face deflection angle of the user is within the face deflection angle range, the user is considered to be facing the camera, that is, the face image of the user in the image is a front face image. After the face image is obtained, face pose estimation can be performed based on the face image to obtain a face deflection angle of a user to which the face image belongs. And if the face deflection angle of the user to which the face image belongs is within the preset face deflection angle range, judging the face image to be a front face image, otherwise, judging the face image to be a side face image.

If the potential user is far away from the camera, the human body image of the potential user in the image taken by the camera is small, and the hand details of the user may not be reflected, which may lead to the subsequent misjudgment of the hand motion of the user. Therefore, in the embodiment of the application, after the potential user in the shooting area is determined, the distance from the potential user to the camera can be acquired. When the distance of the potential user from the camera exceeds a distance threshold, a distance prompt is output. The distance cues are used to cue potential users to approach the camera. If the potential user needs to perform the operation of the space gesture, the potential user can approach the camera according to the distance prompt, so that the judgment accuracy of the gesture recognition object can be improved, and the recognition accuracy of the space gesture of the gesture recognition object can be further improved.

When the scheme of the application is executed by the display device, the display device outputs the distance prompt, which may be that the display device displays the distance prompt. When the application is executed by the post-processing end connected with the display device, the post-processing end outputs the distance prompt, which may be that the post-processing end sends the distance prompt to the connected display device to display the distance prompt on the display device.

Optionally, the implementation process of acquiring the distance from the potential user to the camera includes: and determining the distance from the potential user to the camera according to the focal length of the camera, the distance between the potential user and the eyes in the first image and the distance between the potential user and the eyes of the preset user. Here, the inter-ocular distance of the potential user in the first image may be an inter-ocular distance of the potential user in the first image containing the front face image of the potential user. The preset eye-to-eye distance of the user is a preset fixed value. Because the difference of the actual binocular distances of different users is small, the average value of the actual binocular distances of a plurality of users can be selected as the preset binocular distance of the user.

For example, fig. 3 is a schematic diagram of a ranging principle provided in an embodiment of the present application. As shown in fig. 3, the focal length of the camera is f, the distance between the eyes of the user in the image (i.e. the image plane) of the front face image of the user captured by the camera is M, the distance between the eyes of the user is preset to be K, and if the distance from the user to the camera is d, the distance can be obtained according to the similar triangle principle: m/f = K/d, from which the distance d = (K f)/M of the user to the camera can be derived.

In the embodiment of the application, the camera is not limited to be a monocular camera or a binocular camera or is integrated with a depth sensor, the distance from a user to the camera can be determined based on a similar triangle principle, the calculation mode is simple, and the implementation cost is low.

Optionally, when the camera is a binocular camera, the distance from the potential user to the camera may also be calculated based on the principle of binocular range finding. Alternatively, when the camera is integrated with a depth sensor, the distance of the potential user to the camera may also be measured by the depth sensor. The depth sensor may be an ultrasonic radar, a millimeter wave radar, a laser radar, or a structured light sensor, which is not limited in this application. It should be understood that the depth sensor may be other distance measuring devices.

Step 202, obtaining areas to be identified corresponding to potential users in the multi-frame first images respectively, wherein the areas to be identified corresponding to the potential users comprise hand images of the potential users.

Optionally, each frame of the first image has a region to be identified corresponding to the potential user. Optionally, the area to be identified corresponding to the potential user further includes an image of an elbow of the potential user. The region to be identified in the image according to the embodiment of the present application is a region of interest (ROI) in the image, that is, a region that needs to be processed in the image.

In the process of determining the potential users in the shooting area in step 201, the face images of the potential users in the first images of multiple frames respectively can be obtained. Accordingly, the implementation process of step 202 may include: determining the position of a face image of a potential user in a first image, and determining a region to be recognized corresponding to the potential user in the first image according to the position of the face image of the potential user in the first image. The area to be recognized corresponding to the potential user may include a face image of the potential user in addition to the hand image of the potential user.

Optionally, the face imaging area of the potential user in the first image may be expanded, and the area to be recognized including the hand image and the face image is obtained by cutting. For example, fig. 4 is a schematic image diagram provided in an embodiment of the present application. As shown in fig. 4, the images include a human body image of the user a, a human body image of the user B, a human body image of the user C, and a human body image of the user D. The human body images of the user a and the user B include a front face image, and the human body images of the user C and the user D include a side face image. Assuming that the user A and the user B are potential users in the shooting area, a face imaging area A1 of the user A in the image can be expanded, and an area to be identified (an area A2) corresponding to the user A in the image is obtained by cutting; and expanding a face imaging area B1 of the user B in the image, and cutting to obtain an area to be recognized (an area B2) corresponding to the user B in the image.

And 203, determining hand motions of the potential users according to the areas to be identified corresponding to the potential users in the multi-frame first image.

Optionally, the implementation process of step 203 may include: and respectively carrying out key point detection on the areas to be identified corresponding to the potential users in the multi-frame first image to obtain a plurality of groups of hand key point information of the potential users. And determining the hand motion of the potential user according to the plurality of groups of hand key point information of the potential user.

The method includes the steps of respectively performing key point detection on to-be-identified areas corresponding to potential users in multiple frames of first images to obtain multiple groups of hand key point information of the potential users, and performing key point detection on to-be-identified areas corresponding to the potential users in each frame of first images to obtain a group of hand key point information of the potential users.

Optionally, the set of hand keypoint information includes positions of a plurality of hand keypoints and connection relationships between the plurality of hand keypoints. Each hand keypoint represents a particular part of the hand. For example, fig. 5 is a schematic distribution diagram of a hand key point provided in an embodiment of the present application. As shown in fig. 5, the hand may include 21 hand key points, i.e., a wrist (0), a carpal metacarpal joint (1), a thumb metacarpophalangeal joint (2), a thumb interphalangeal joint (3), a thumb fingertip (4), an index metacarpophalangeal joint (5), an index proximal interphalangeal joint (6), an index distal interphalangeal joint (7), an index fingertip (8), a middle finger metacarpophalangeal joint (9), a middle finger proximal interphalangeal joint (10), a middle finger distal interphalangeal joint (11), a middle finger fingertip (12), a ring finger metacarpophalangeal joint (13), a ring finger proximal interphalangeal joint (14), a ring finger distal interphalangeal joint (15), a ring finger fingertip (16), a little finger metacarpophalangeal joint (17), a little finger proximal interphalangeal joint (18), a little finger distal interphalangeal joint (19), and a little finger fingertip (20).

When the key point detection is performed on the to-be-identified area containing the hand image of the potential user, the 21 hand key points can be detected, or more or fewer hand key points can be detected.

For example, the keypoint detection may be performed on the region to be identified using a deep neural network-based keypoint detector, which may be implemented based on thermodynamic diagram (heatmap) techniques. The key point detector can detect key points of the area to be identified in a bottom-up manner. Assuming that the detection target includes 21 hand key points, a thermodynamic diagram including 21 channels may be generated, where each channel is a probability diagram (thermodynamic distribution diagram) of one hand key point; the number in the probability map represents the size of the probability of the hand key point, and the closer the number is to 1, the higher the probability of the hand key point. And simultaneously generating a vector diagram containing 21-by-2 channels, wherein each 2 channels contain position information (two-dimensional information) of a hand key point. The positions of the key points of the hand can be obtained. Further, the key point detector connects the detected hand key points based on Partial Affinity Fields (PAFs), so that a connection relationship between a plurality of hand key points can be obtained.

Optionally, after obtaining the plurality of sets of hand key point information of the potential user, morphological changes and/or displacements of the hand of the potential user may be determined according to the plurality of sets of hand key point information, so as to determine the hand motion of the potential user.

When a plurality of hand key points of the user are too compact or part of the hand key points are missing in the detection result, the hand motion of the user may be misjudged or missed, so that the moving direction of the hand of the user can be determined in an auxiliary manner by combining the moving direction of the elbow key points of the user, and the hand motion of the user can be further determined. Optionally, the to-be-identified region corresponding to the potential user in the first image may include a hand image and an elbow image of the potential user. The implementation of step 203 may include: respectively carrying out key point detection on the areas to be identified corresponding to the potential users in the multi-frame first images to obtain multiple groups of key point information of the hands and the elbows of the potential users; and determining the hand motion of the potential user according to the multiple groups of hand and elbow key point information of the potential user.

And 204, determining a target user in the one or more potential users as a gesture recognition object, wherein the hand motion of the target user is matched with a preset gesture.

Optionally, the preset gesture includes a starting portion of the gesture to be recognized. For example, if a complete gesture to be recognized needs 10 frames of images to be determined, a gesture corresponding to the initial 3 frames of images determined as the gesture to be recognized may be selected as the preset gesture. Therefore, when the user needs to perform the space gesture operation, the gesture to be recognized is directly executed in the shooting area of the camera, other specific awakening gestures are not needed to be executed to start the gesture recognition function of the equipment, the gesture recognition object is determined under the condition that the user does not sense the gesture, the user operation can be simplified, and the user experience is improved.

The gesture to be recognized is a gesture which is configured in the display device in advance and can be converted into a control instruction. For example, in a conference scene, the gesture to be recognized, which is configured in advance on the conference terminal, may include a page-up gesture, a page-down gesture, a page-left gesture, a page-right gesture, and a screen capture gesture. When it is determined that the hand motion of a potential user moves from left to right, it may be determined that the hand motion of the potential user matches the start portion of the page left gesture, and the potential user may be determined to be the gesture recognition object.

Optionally, when a plurality of potential users with hand motions matching the preset gestures exist in the shooting area of the camera, the potential user closest to the camera is taken as the target user.

In the embodiment of the application, only one gesture recognition object can be determined at most at the same time, and the gesture recognition object may change along with the time. The closer the user is to the camera, the higher the probability that the user is determined to be a gesture recognition object.

Optionally, when the number of images, which do not include the face image of the target user, in the image obtained by shooting the shooting area by the camera after the target user serves as the gesture recognition object exceeds a number threshold, or when the duration of the target user serving as the gesture recognition object reaches a duration threshold, the target user is regarded as the gesture recognition object. For example, the number threshold may be 3 frames, that is, when the number of face images of the target user is not included in the image acquired by the camera exceeds 3 frames, the target user is no longer used as the gesture recognition object, and the gesture recognition object determination process is started again. The duration threshold is a preset aging time, which may be 20 seconds, for example, that the maximum effective duration of the gesture recognition object determined each time is 20 seconds, and if the maximum effective duration exceeds 20 seconds, the determined gesture recognition object is invalid, and the gesture recognition object needs to be determined again, so as to meet the requirement that the gesture recognition object in the application scene is flexible and changeable.

Optionally, the condition for ending the target user as the gesture recognition object may also be that the target user does not make a correct gesture to be recognized within a certain time (the value is smaller than the aging time) after the target user serves as the gesture recognition object, or that the target user has put his hand down is detected, or that the target user has kept his hand still (in this case, the case that the user maliciously occupies the gesture recognition object may be excluded), and so on.

For example, assume that a potential user is determined using 3 frames of images, and that a face image in the determination condition is a front face image, and the condition for ending the user as a gesture recognition target is: the number of images, which do not include a face image of a user, among images captured by a camera of a capturing area after the user is a gesture recognition object exceeds a number threshold. The implementation of the gesture recognition object determination method provided in the embodiment of the present application may be as follows: in the process of determining the gesture recognition object, if there are 3 frames of images satisfying the front face image including the user and it is determined that the hand motion of the user matches the preset gesture based on the 3 frames of images, the user may be determined as the gesture recognition object, and then gesture recognition may be performed on the user. Meanwhile, after the user is determined as the gesture recognition object, whether the front face image of the user is included in the next acquired images or not is detected in real time, if the number of the images excluding the front face image of the user reaches a certain number, the user is taken as the gesture recognition object, and the gesture recognition object determination process is executed again. The gesture recognition object determination process may be performed when there is no gesture recognition object in the shooting area of the camera, that is, after the gesture recognition object is determined, the display device or a post-processing terminal connected to the display device may stop performing the gesture recognition object determination process until the gesture recognition object determined last time is invalid.

And step 205, acquiring a to-be-identified area corresponding to the target user in the multi-frame second image, wherein the to-be-identified area corresponding to the target user comprises a hand image of the target user.

Here, the to-be-identified region corresponding to the target user in the multi-frame second image is obtained, which may be understood as obtaining only the to-be-identified region corresponding to the target user in the multi-frame second image, and not obtaining the to-be-identified regions corresponding to other users except the target user in the multi-frame second image. The plurality of frames of second images are obtained by shooting a shooting area by a camera after a target user is determined as a gesture recognition object. That is, the shooting time of the second image is chronologically after the shooting time of the first image. For example, in one case, the capturing time of the second image of the plurality of frames and the capturing time of the first image of the plurality of frames are consecutive, that is, the first N frames of images captured of the capturing area by the camera are the first images, and the images captured after the N frames of images are the second images. In another case, the capturing timing of the second image of the plurality of frames may be discontinuous from the capturing timing of the first image of the plurality of frames. In the present application, the shooting timing of an image is distinguished by using a "first image" which refers to an image taken by a camera before a gesture recognition object is determined and a "second image" which refers to an image taken by the camera after the gesture recognition object is determined.

Optionally, after the target user is determined as the gesture recognition object in step 204, the face information of the target user may be further saved, so as to associate the hand motion of the target user with the face information of the target user, implement hand tracking on the target user, and further implement gesture recognition on the target user. The face information of the target user comprises the position and the movement trend of the face image of the target user in the multi-frame first image shot by the camera, or the face information of the target user comprises the face features of the target user. The implementation process of step 205 may include: and determining the position of the face image of the target user in the second image according to the stored face information of the target user. And determining a region to be identified corresponding to the target user in the second image according to the position of the face image of the target user in the second image.

Optionally, the second image may be subjected to face detection to obtain a face image in the second image. When the stored face information of the target user is the position and the movement trend of the face image of the target user in the multi-frame first image shot by the camera, the face image of the target user in the second image can be determined through a face tracking algorithm, for example, face detection is performed on the second image based on the MTCNN, and the IoU values of one or more face detection frames in the second image and the face detection frame of the target user in the previous frame image are respectively calculated to determine the face detection frame of the target user in the second image, so that the face image position of the target user in the second image is obtained. Or, when the stored face information of the target user is the face feature of the target user, after one or more face images in the second image are obtained, which face image in the second image belongs to the target user can be determined in a manner of calculating the similarity of the faces.

Optionally, the implementation manner of determining the to-be-recognized region corresponding to the target user in the second image according to the face image position of the target user in the second image may refer to the implementation manner of determining the to-be-recognized region corresponding to the potential user in the first image according to the face image position of the potential user in the first image in step 202, which is not described herein again in this embodiment of the present application.

And step 206, performing gesture recognition on the target user according to the to-be-recognized area corresponding to the target user in the multi-frame second image.

In the process that the target user serves as a gesture recognition object, gesture recognition can be continuously performed on the target user. And performing gesture recognition on the target user, wherein the gesture recognition can be performed by judging whether the hand action of the target user is matched with a preset gesture to be recognized.

Optionally, the implementation manner of step 206 includes: and taking the area to be recognized corresponding to the target user in the second images as an image sequence to be input into the gesture recognition model so as to obtain a gesture recognition result output by the gesture recognition model. The gesture recognition result can indicate a certain preset gesture to be recognized, and the gesture to be recognized is shown to be executed by the target user in the shooting time period of the multiple frames of second images; or the gesture recognition result may indicate that there is no matched gesture to be recognized, which indicates that the target user does not execute any preset gesture to be recognized in the shooting period of the multiple frames of second images; or the gesture recognition result may include preset confidence levels of the gestures to be recognized, the display device or the post-processing end connected to the display device may use the gesture to be recognized having the highest confidence level and being higher than a certain threshold as the gesture to be recognized executed by the target user, and if there is no gesture to be recognized having a confidence level higher than the threshold, it indicates that the target user has not executed any preset gesture to be recognized in the shooting period of the multiple frames of second images.

Further, after gesture recognition is performed on the target user, if it is determined that the target user executes a gesture to be recognized within the shooting period of the multiple frames of second images, responding to a control instruction corresponding to the gesture to be recognized to achieve an isolated gesture operation, and continuing to perform gesture recognition on the target user until a condition that the target user is taken as a gesture recognition object is met. If it is determined that the target user does not perform any gesture to be recognized within the capturing period of the multiple frames of second images, gesture recognition may be continued on the target user until a condition for ending the target user as a gesture recognition object is satisfied.

Optionally, if the preset gesture used for determining the gesture recognition object includes the initial portion of the gesture to be recognized, it may be determined whether the target user performs the gesture to be recognized including the preset gesture according to the region to be recognized corresponding to the target user in the multiple frames of the first image and the region to be recognized corresponding to the target user in the multiple frames of the second image. That is, after a user performs a gesture to be recognized in a shooting area of the camera, the display device or a post-processing terminal connected to the display device may determine the user as a gesture recognition object based on the gesture to be recognized, and then respond to a manipulation instruction corresponding to the gesture to be recognized.

In the embodiment of the application, in the process that the target user is used as a gesture recognition object, the display device or the post-processing end connected with the display device only performs gesture recognition on the target user within a period of time, and does not perform gesture recognition on other users except the target user, namely, the gesture of one user is locked and recognized within a period of time, so that the problem that gesture mutual interference between the users causes that accurate gesture control cannot be realized can be avoided.

The sequence of the steps of the gesture recognition object determining method provided by the embodiment of the application can be properly adjusted, and the steps can be correspondingly increased or decreased according to the situation. Any method that can be easily conceived by a person skilled in the art within the technical scope disclosed in the present application is covered by the protection scope of the present application, and thus the detailed description thereof is omitted.

In summary, in the gesture recognition object determining method provided in the embodiment of the present application, a user who has a face image in each frame of multiple frames of images captured by a camera and whose hand motion matches a preset gesture is determined as a gesture recognition object in a capturing area of the camera. The gesture recognition object can be automatically determined based on the image shot by the camera, gesture recognition can be further performed on the gesture recognition object to achieve space gesture operation, the gesture recognition method is suitable for gesture recognition under various scenes, especially multi-user scenes, and the implementation mode is simple. In addition, in the process of gesture recognition of the gesture recognition object, gestures of other users except the gesture recognition object cannot be recognized, and the problem that gesture control cannot be accurately performed due to mutual interference of the gestures of the users can be avoided. Alternatively, users facing the camera inside the shooting area may be excluded, and the gesture recognition object may be determined only among the users facing the camera, so that the probability of misjudgment on the gesture recognition object may be reduced. Optionally, the starting portion of the gesture to be recognized is used as a preset gesture for judging the gesture recognition object, when the user needs to perform the space gesture operation, the gesture to be recognized is directly executed in the shooting area of the camera, other specific wake-up gestures are not required to be executed to start the gesture recognition function of the device, the gesture recognition object is determined under the condition that the user does not sense the gesture, the user operation can be simplified, and the user experience is improved.

Fig. 6 is a schematic structural diagram of a gesture recognition object determination apparatus according to an embodiment of the present application. As shown in fig. 6, the apparatus 600 includes:

the first determining module 601 is configured to determine one or more potential users in a shooting area according to multiple frames of first images obtained by shooting the shooting area by a camera, where the potential users satisfy: each frame of the first images of the plurality of frames comprises a face image of the potential user.

The second determining module 602 is configured to determine a hand action of the potential user according to a to-be-identified region corresponding to the potential user in the multiple frames of first images, where the to-be-identified region corresponding to the potential user includes a hand image of the potential user.

A third determining module 603, configured to determine a target user of the one or more potential users as a gesture recognition object, where the hand motion of the target user matches a preset gesture.

Optionally, as shown in fig. 7, the apparatus 600 further includes: the first obtaining module 604 is configured to, after determining a target user of the one or more potential users as a gesture recognition object, obtain a to-be-recognized area corresponding to the target user in multiple frames of second images, where the to-be-recognized area corresponding to the target user includes a hand image of the target user, and the multiple frames of second images are obtained by shooting, by a camera, a shooting area after determining the target user as the gesture recognition object. And the gesture recognition module 605 is configured to perform gesture recognition on the target user according to the to-be-recognized region corresponding to the target user in the multiple frames of second images.

Optionally, the preset gesture includes a starting portion of the gesture to be recognized. A gesture recognition module 605 to: and judging whether the target user executes the gesture to be recognized or not according to the region to be recognized corresponding to the target user in the multi-frame first image and the region to be recognized corresponding to the target user in the multi-frame second image.

Optionally, the first obtaining module 604 is configured to: and determining the face image position of the target user in the second image according to the stored face information of the target user. And determining the area to be recognized corresponding to the target user in the second image according to the position of the face image of the target user in the second image.

Optionally, as shown in fig. 8, the apparatus 600 further includes: a fourth determining module 606, configured to, when the number of images, which do not include the face image of the target user, in the image obtained by shooting the shooting area after the target user serves as the gesture recognition object by the camera exceeds a number threshold, or when a duration of the target user serving as the gesture recognition object reaches a duration threshold, end taking the target user as the gesture recognition object.

Optionally, as shown in fig. 9, the apparatus 600 further includes: a fifth determining module 607, configured to determine the face image positions of the potential users in the first image after determining one or more potential users in the shooting area. A sixth determining module 608, configured to determine, according to the position of the facial image of the potential user in the first image, a to-be-recognized region corresponding to the potential user in the first image.

Optionally, the third determining module 603 is configured to: when a plurality of potential users with hand motions matched with the preset gestures exist in the shooting area, the potential user closest to the camera is taken as a target user.

Optionally, as shown in fig. 10, the apparatus 600 further includes: a second obtaining module 609, configured to obtain a distance from the potential user to the camera. An output module 610, configured to output a distance prompt when the distance from the potential user to the camera exceeds a distance threshold, the distance prompt being used to prompt the potential user to approach the camera. If the gesture recognition object determination apparatus is a display device, the output module 610 is specifically a display module. Alternatively, if the gesture recognition object determination device is a post-processing end, the output module 610 is specifically a sending module.

Optionally, the second obtaining module 609 is configured to: and determining the distance from the potential user to the camera according to the focal length of the camera, the distance between the potential user and the eyes in the first image and the preset distance between the potential user and the eyes of the user.

Optionally, the second determining module 602 is configured to: and respectively carrying out key point detection on the areas to be identified corresponding to the potential users in the multi-frame first images to obtain a plurality of groups of hand key point information of the potential users. And determining the hand actions of the potential users according to the plurality of groups of hand key point information of the potential users.

Optionally, the region to be identified corresponding to the potential user further includes an image of an elbow of the potential user. A second determining module 602, configured to: and respectively carrying out key point detection on the areas to be identified corresponding to the potential users in the multi-frame first images to obtain multiple groups of key point information of the hands and the elbows of the potential users. And determining the hand actions of the potential users according to the multiple groups of hand and elbow key point information of the potential users.

Optionally, the face image is a front face image.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 11 is a block diagram of a gesture recognition object determination device according to an embodiment of the present application. The gesture recognition object determination device may be a general purpose computing device, such as in a meeting scenario, which may be a meeting terminal or post-meeting processing end. Alternatively, the conference terminal may be a large screen or an electronic whiteboard, etc. The post-meeting processing end can be one server, a server cluster consisting of a plurality of servers, a cloud computing platform and the like. As shown in fig. 11, the gesture recognition object determination device 1100 includes: a processor 1101 and a memory 1102.

A memory 1102 for storing a computer program comprising program instructions;

a processor 1101 for invoking the computer program for implementing the method steps as shown in fig. 2 in the above method embodiment.

Optionally, the gesture recognition object determination device 1100 further comprises a communication bus 1103 and a communication interface 1104.

The processor 1101 includes one or more processing cores, and the processor 1101 executes various functional applications and data processing by running a computer program.

The memory 1102 may be used to store computer programs. Alternatively, the memory may store an operating system and application program elements required for at least one function. The operating system may be a Real Time eXecutive (RTX) operating system, such as LINUX, UNIX, WINDOWS, or OS X.

The communication interface 1104 may be multiple, and the communication interface 1104 is used for communication with other storage devices or network devices. For example, in this embodiment of the present application, when the gesture recognition object determination device is a post-conference processing end, a communication interface of the post-conference processing end may be used to send a gesture recognition object determination result to the conference terminal. The network device may be a switch or router, etc.

The memory 1102 and the communication interface 1104 are connected to the processor 1101 by a communication bus 1103, respectively.

The present application further provides a computer-readable storage medium, which stores instructions that, when executed by a processor, implement the method steps shown in fig. 2 in the above method embodiment.

An embodiment of the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method steps shown in fig. 2 in the above method embodiment are implemented.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

In the embodiments of the present application, the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The term "and/or" in this application is only one kind of association relationship describing the association object, and means that there may be three kinds of relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship.

The above description is intended only to illustrate the alternative embodiments of the present application, and not to limit the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A gesture recognition object determination method, the method comprising:

according to a plurality of frames of first images obtained by shooting a shooting area by a camera, determining one or more potential users in the shooting area, wherein the potential users satisfy the following conditions: each frame of the plurality of frames of first images comprises a face image of the potential user;

determining hand motions of the potential user according to a to-be-identified area corresponding to the potential user in the multi-frame first image, wherein the to-be-identified area corresponding to the potential user comprises a hand image of the potential user;

determining a target user of the one or more potential users as a gesture recognition object, wherein the hand motion of the target user is matched with a preset gesture.

2. The method of claim 1, wherein after determining a target user of the one or more potential users as a gesture recognition object, the method further comprises:

acquiring a to-be-recognized area corresponding to the target user in a plurality of frames of second images, wherein the to-be-recognized area corresponding to the target user comprises a hand image of the target user, and the plurality of frames of second images are obtained by shooting the shooting area by the camera after the target user is determined as a gesture recognition object;

and performing gesture recognition on the target user according to the to-be-recognized area corresponding to the target user in the multi-frame second image.

3. The method according to claim 2, wherein the preset gesture comprises an initial portion of a gesture to be recognized, and the gesture recognition on the target user according to the region to be recognized corresponding to the target user in the multiple frames of second images comprises:

and judging whether the target user executes the gesture to be recognized or not according to the region to be recognized corresponding to the target user in the multi-frame first image and the region to be recognized corresponding to the target user in the multi-frame second image.

4. The method according to claim 2 or 3, wherein the acquiring the to-be-identified region corresponding to the target user in the second images of the plurality of frames comprises:

determining the position of the face image of the target user in the second image according to the stored face information of the target user;

and determining a region to be identified corresponding to the target user in the second image according to the position of the face image of the target user in the second image.

5. The method of any of claims 1 to 4, further comprising:

when the number of images, which do not include the face image of the target user, in the images obtained by shooting the shooting area by the camera after the target user is used as the gesture recognition object exceeds a number threshold, or when the duration of the target user as the gesture recognition object reaches a duration threshold, finishing taking the target user as the gesture recognition object.

6. The method of any of claims 1 to 5, wherein after determining one or more potential users within the capture area, the method further comprises:

determining a facial image position of the potential user in the first image;

and determining a region to be identified corresponding to the potential user in the first image according to the position of the face image of the potential user in the first image.

7. The method of any one of claims 1 to 6, wherein determining a target user of the one or more potential users as a gesture recognition object comprises:

when a plurality of potential users with hand motions matched with the preset gestures exist in the shooting area, taking the potential user closest to the camera as the target user.

8. The method of any of claims 1 to 7, wherein after determining one or more potential users within the capture area, the method further comprises:

obtaining a distance of the potential user to the camera;

outputting a distance prompt when the distance from the potential user to the camera exceeds a distance threshold, the distance prompt being used to prompt the potential user to approach the camera.

9. The method of claim 8, wherein the obtaining the distance of the potential user from the camera comprises:

and determining the distance from the potential user to the camera according to the focal length of the camera, the binocular distance of the potential user in the first image and the preset binocular distance of the potential user.

10. The method according to any one of claims 1 to 9, wherein the determining the hand motion of the potential user according to the area to be identified corresponding to the potential user in the plurality of frames of first images comprises:

respectively carrying out key point detection on the areas to be identified corresponding to the potential users in the multi-frame first images to obtain a plurality of groups of hand key point information of the potential users;

and determining the hand motions of the potential user according to the plurality of groups of hand key point information of the potential user.

11. The method according to any one of claims 1 to 9, wherein the region to be identified corresponding to the potential user further includes an elbow image of the potential user, and the determining the hand motion of the potential user according to the region to be identified corresponding to the potential user in the first image of the plurality of frames comprises:

respectively carrying out key point detection on the areas to be identified corresponding to the potential users in the multi-frame first images to obtain multiple groups of hand and elbow key point information of the potential users;

and determining the hand motions of the potential user according to the multiple groups of hand and elbow key point information of the potential user.

12. The method according to any one of claims 1 to 11, wherein the face image is a front face image.

13. A gesture recognition object determination apparatus, the apparatus comprising:

the first determining module is used for determining one or more potential users in a shooting area according to a plurality of frames of first images obtained by shooting the shooting area by a camera, wherein the potential users meet the following conditions: each frame of the plurality of frames of first images comprises a face image of the potential user;

a second determining module, configured to determine a hand motion of the potential user according to a to-be-identified region corresponding to the potential user in the multiple-frame first image, where the to-be-identified region corresponding to the potential user includes a hand image of the potential user;

and the third determination module is used for determining a target user in the one or more potential users as a gesture recognition object, and the hand motion of the target user is matched with a preset gesture.

14. The apparatus of claim 13, further comprising:

a first obtaining module, configured to obtain a to-be-recognized region corresponding to a target user in multiple frames of second images after the target user in the one or more potential users is determined as a gesture recognition object, where the to-be-recognized region corresponding to the target user includes a hand image of the target user, and the multiple frames of second images are obtained by shooting the shooting region by the camera after the target user is determined as the gesture recognition object;

and the gesture recognition module is used for performing gesture recognition on the target user according to the to-be-recognized area corresponding to the target user in the multi-frame second image.

15. The apparatus of claim 14, wherein the preset gesture comprises a starting portion of a gesture to be recognized, and the gesture recognition module is configured to:

16. The apparatus of claim 14 or 15, wherein the first obtaining module is configured to:

17. The apparatus of any one of claims 13 to 16, further comprising:

a fourth determining module, configured to, when the number of images that are obtained by shooting the shooting area by the camera after the target user serves as a gesture recognition object and do not include the face image of the target user exceeds a number threshold, or when a duration of the target user serving as the gesture recognition object reaches a duration threshold, end taking the target user as the gesture recognition object.

18. The apparatus of any of claims 13 to 17, further comprising:

a fifth determining module, configured to determine, after determining one or more potential users in the shooting area, face image positions of the potential users in the first image;

a sixth determining module, configured to determine, according to a face image position of the potential user in the first image, a region to be identified corresponding to the potential user in the first image.

19. The apparatus of any of claims 13 to 18, wherein the third determining module is configured to:

20. The apparatus of any one of claims 13 to 19, further comprising:

a second acquisition module for acquiring a distance from the potential user to the camera;

an output module configured to output a distance prompt when a distance from the potential user to the camera exceeds a distance threshold, the distance prompt being used to prompt the potential user to approach the camera.

21. The apparatus of claim 20, wherein the second obtaining module is configured to:

22. The apparatus of any of claims 13 to 21, wherein the second determining module is configured to:

23. The apparatus according to any one of claims 13 to 21, wherein the region to be identified corresponding to the potential user further includes an image of an elbow of the potential user, and the second determining module is configured to:

respectively performing key point detection on the areas to be identified corresponding to the potential users in the multi-frame first images to obtain multiple groups of key point information of hands and elbows of the potential users;

24. The apparatus according to any one of claims 13 to 23, wherein the face image is a front face image.

25. A gesture recognition object determination device, characterized by comprising: a processor and a memory;

the processor is configured to invoke the computer program to implement the gesture recognition object determination method according to any one of claims 1 to 12.

26. A computer-readable storage medium having stored thereon instructions which, when executed by a processor, implement a gesture recognition object determination method according to any one of claims 1 to 12.

27. A computer program product comprising a computer program which, when executed by a processor, carries out a gesture recognition object determination method according to any one of claims 1 to 12.