CN114898459A

CN114898459A - Method for gesture recognition and related product

Info

Publication number: CN114898459A
Application number: CN202210398206.9A
Authority: CN
Inventors: 李翌昕; 范睿凯; 伍更新; 王博威; 张广勇; 林辉; 段亦涛
Original assignee: Netease Youdao Information Technology Beijing Co Ltd
Current assignee: Netease Youdao Information Technology Beijing Co Ltd
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-08-12

Abstract

Embodiments of the present invention provide a method for gesture recognition and a related product, the method comprising: acquiring a video frame containing a hand in real time; detecting whether a hand in the video frame contains a predetermined trigger action; and performing gesture recognition in response to detecting the predetermined trigger action. According to the embodiment of the invention, the gesture recognition is started only after the preset trigger action is detected, so that the gesture recognition after the trigger condition can be focused, and the recognition efficiency and accuracy are high.

Description

Method for gesture recognition and related product

Technical Field

The embodiment of the invention relates to the technical field of computer vision and human-computer interaction, in particular to a method for gesture recognition, a device for gesture recognition and a computer-readable storage medium.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.

The gesture, i.e. the posture of the hand, refers to the specific actions and body positions of the person when the person is using the arm. Gestures can be used for expression, which is the earliest used by humans and still widely used as a means of interaction. In the fields of human-computer interaction and computer vision, gesture interaction technology can control equipment through gestures. The gesture interaction can be applied to interaction scenes of automobiles, household appliances, mobile phones, AR/VR, games, marketing, education, government affairs and the like. For example, in an application scenario of smart learning, gesture interaction may be used for smart reading, i.e., determining content to be read through gestures.

The core technology of gesture interaction is gesture recognition, which generally includes: firstly, shooting an image or a video through an optical system, and acquiring gesture information; and then recognizing the intention of the user according to the collected gesture information based on an image processing algorithm.

Specifically, the gesture recognition technology may include one or more of hand/finger/hand key point detection, hand segmentation, target tracking, sequence prediction, sequence logic determination, and the like. For example, chinese patent application publication No. CN111324201A discloses a reading method, device and system based on somatosensory interaction. The technical scheme of the patent document is used in the field of intelligent reading, and various interactive gestures are designed, wherein the interactive gestures comprise single pointing, double continuous clicking of a single finger, single finger circle, palm horizontal placement and the like.

The main drawback of this solution is that only gestures of a single hand can be recognized, and if there is still one stationary hand in the recognition range, the gesture recognition may be interfered, and the recognition accuracy is affected. In addition, even if a single hand is recognized, erroneous determination is often made. In some cases, for example, when performing a long gesture (such as drawing a circle) and making an error, the device cannot terminate the flow in time and may wait for a long time. Further, some of the prior art solutions rely on higher quality images, which requires higher requirements for the imaging device, which increases the overall equipment cost.

Disclosure of Invention

In the prior art, gesture recognition has the disadvantages. For this reason, a method for gesture recognition is highly desirable to address at least one of the above drawbacks. Meanwhile, the invention also provides a device and a computer readable storage medium for gesture recognition.

In this context, embodiments of the present invention are intended to provide a method for gesture recognition and related products.

In a first aspect of embodiments of the present invention, there is provided a method for gesture recognition, comprising: acquiring a video frame containing a hand in real time; detecting whether a hand in the video frame contains a predetermined trigger action; and performing gesture recognition in response to detecting the predetermined trigger action.

In one embodiment, detecting whether a hand in the video frame contains a predetermined trigger action comprises: detecting fingertip information in the video frame; determining from the fingertip information whether a hand in the video frame includes a predetermined trigger action.

In another embodiment, detecting fingertip information in the video frame comprises: detecting a position of the fingertip; and detecting the motion track of the fingertip.

In yet another embodiment, detecting the motion trajectory of the fingertip comprises: estimating a predicted position of the fingertip in a next frame based on a historical position of the fingertip; and determining the track of the fingertip in time sequence based on the predicted position and the detected position of the fingertip in the next frame.

In yet another embodiment, estimating the predicted position of the fingertip in the next frame based on the historical position comprises: the predicted position is estimated using a second order difference method or an extrapolation method.

In one embodiment, determining the temporal trajectory of the fingertip based on the predicted position and the detected position of the fingertip in the next frame comprises: adding the detection position of the fingertip in the next frame into the track of the fingertip on the time sequence in response to the difference between the predicted position and the detection position of the fingertip in the next frame being within a set range; and in response to the difference between the predicted position and the detected position of the fingertip in the next frame being out of the set range, adding the predicted position of the fingertip in the next frame to the track of the fingertip in time sequence.

In another embodiment, detecting whether a hand in the video frame contains a predetermined trigger action comprises: and if the fingertip is detected to stay within a preset range in a preset number of video frames, determining that the hand in the video frames contains a preset trigger action.

In yet another embodiment, if it is detected that a plurality of fingertips stay within a predetermined range in a preset number of video frames, a predetermined trigger contained in a hand in the video frames is determined as a pointing trigger action.

In yet another embodiment, in response to detecting the predetermined trigger action, performing gesture recognition includes: and determining that the preset trigger action contained in the hand in the video frame is a pointing trigger action, returning to the current trigger state and returning to the current pointing coordinate.

In one embodiment, if a single fingertip is detected to stay within a predetermined range within a preset number of video frames, a predetermined trigger contained by a hand in the video frame is determined as a circling trigger action.

In another embodiment, in response to detecting the predetermined trigger action, performing gesture recognition comprises: in response to detecting a cancel circling condition, terminating current gesture recognition; and/or returning a gesture recognition result in response to detecting a complete circling condition.

In yet another embodiment, the un-circling condition comprises an active un-circling condition and a passive un-circling condition.

In yet another embodiment, the active un-circling condition comprises any one of: a multi-finger open gesture, a fist-making gesture, a swing gesture, and a move-out recognition zone gesture.

In one embodiment, the passive un-circling condition comprises: the circling motion is detected to stop.

In another embodiment, the circling completion condition includes any one of: the motion track of the fingertip passes through the starting point of the motion track; or the motion trail of the fingertip passes through any point of the motion trail.

In yet another embodiment, the movement trajectory of the fingertip passing the starting point of the movement trajectory includes: the end point of the motion track and the starting point of the motion track are within a preset distance.

In yet another embodiment, in response to detecting the complete circling condition, returning the gesture recognition result comprises: the current trigger state is returned and the current circling trajectory is returned.

In a second aspect of embodiments of the present invention, there is provided an apparatus for gesture recognition, comprising: a processor configured to execute program instructions; and a memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform the method for gesture recognition according to any one of the first aspect of embodiments of the present invention.

In a third aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored therein program instructions which, when loaded and executed by a processor, cause the processor to perform a method for gesture recognition according to any one of the first aspect of embodiments of the present invention.

According to the embodiment of the invention, the gesture recognition is started only after the preset trigger action is detected, namely the gesture recognition is started when the trigger condition is met, and the gesture recognition is not performed before the trigger condition is met. In other words, only a specific trigger action needs to be detected before the trigger condition is met, and other actions except the trigger action do not need to be judged, so that the gesture recognition after the trigger condition can be focused, and therefore, the recognition efficiency and the recognition accuracy are high.

Furthermore, based on the design of the cancel logic, the user can be more flexible in the circling process, for example, the user can use a corresponding gesture to terminate the circling process at any time, and the device can be timely perceived to prepare for a new round of detection; and the circling gesture recognition process can also be skipped in time when the algorithm fails so as to restart the detection.

Further, the various logical decisions for gesture interaction rely solely on the coordinates of the fingertips, and are therefore simple and efficient to apply. Fingertip coordinates can be obtained from a target detection model known in the prior art or developed in the future, so that only a common monocular camera is needed to meet the requirements of the invention.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates a block diagram of an exemplary computing system 100 suitable for implementing embodiments of the present invention;

FIG. 2 schematically illustrates a flow diagram of a method 200 for gesture recognition in accordance with one embodiment of the present invention;

FIG. 3 schematically illustrates a relationship diagram of a detection means and its corresponding fingertip information in accordance with one embodiment of the present invention;

FIG. 4 schematically shows a flow diagram of a method for gesture recognition according to one embodiment of the present invention; and

FIG. 5 schematically shows a schematic block diagram of an apparatus 1000 for gesture recognition according to an embodiment of the present invention;

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 illustrates a block diagram of an exemplary computing system 100 suitable for implementing embodiments of the present invention. As shown in fig. 1, computing system 100 may include: a Central Processing Unit (CPU)101, a Random Access Memory (RAM)102, a Read Only Memory (ROM)103, a system bus 104, a hard disk controller 105, a keyboard controller 106, a serial interface controller 107, a parallel interface controller 108, a display controller 109, a hard disk 110, a keyboard 111, a serial external device 112, a parallel external device 113, and a display 114. Among these devices, coupled to the system bus 104 are a CPU 101, a RAM 102, a ROM 103, a hard disk controller 105, a keyboard controller 106, a serial controller 107, a parallel controller 108, and a display controller 109. The hard disk 110 is coupled to the hard disk controller 105, the keyboard 111 is coupled to the keyboard controller 106, the serial external device 112 is coupled to the serial interface controller 107, the parallel external device 113 is coupled to the parallel interface controller 108, and the display 114 is coupled to the display controller 109. It should be understood that the block diagram of the architecture depicted in FIG. 1 is for purposes of illustration only and is not intended to limit the scope of the present invention. In some cases, certain devices may be added or subtracted as the case may be.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, method or computer program product. Accordingly, the present disclosure may be embodied in the form of: the term "computer readable medium" as used herein refers to any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied in the medium.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive example) of the computer readable storage medium may include, for example: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

Embodiments of the present invention will be described below with reference to flowchart illustrations of methods and block diagrams of apparatuses (or systems) of embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

According to the embodiment of the invention, a method for gesture recognition and a related product thereof are provided. Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Summary of The Invention

The inventor finds that the reason why misjudgment is frequently caused in the prior art is that the conventional gesture recognition scheme has no clear trigger condition, so that various meaningless gestures can be recognized, and finally, misjudgment is frequently caused. If a clear trigger condition is added and the recognition is performed after the clear trigger condition is detected, the recognition efficiency and accuracy can be improved.

The inventor also finds that if a mistake occurs while performing a long gesture (such as circling), the device cannot terminate the flow in time, which may cause a long waiting time, and thus the interactivity is deteriorated. If the termination condition is added, the gesture recognition is stopped after the termination condition is detected, a new gesture recognition process can be started in time, and the interactivity is better.

The inventor also finds that the hand performing the gesture interaction can be judged by tracking the hand performing the gesture interaction, and meanwhile, the interference of the other hands is eliminated, so that the gesture interaction can be performed by multiple hands at the same time without limiting the user to use only one hand.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Exemplary method

A method for gesture recognition according to an exemplary embodiment of the present invention is described below with reference to fig. 2. It should be noted that the embodiments of the present invention can be applied to any applicable scenario. For example, smart learning, assisted learning, smart reading, and the like; the method can be adopted in any application scene capable of using gesture interaction.

Fig. 2 schematically shows a flow chart of a method 200 for gesture recognition according to an embodiment of the present invention, comprising step S201, step S202 and step S203.

In step S201, a video frame including a hand is acquired in real time. Here, "acquiring in real time" means that the video frames need to be continuously identified and determined, for example, each video frame is processed or the video frames are processed at intervals. In addition, the acquired video frame can be an infrared image, an RGB-D image, and the like according to the adopted acquisition device or sensor. In this embodiment, since the cost of the apparatus for acquiring the RGB image is low and the requirement of the recognition method on the image quality is not high, the RGB image is preferably used.

As an example, the capturing device may be a fixed camera, and the fixed camera may capture a fixed recognition area. The acquisition device may also be a movable camera that can shoot the identification area corresponding to the moving target.

It should be noted that, in step S201, the video frames containing the hands have hand information, but it is not meant that each video frame necessarily contains hand information, for example, in some cases, the hands may quickly leave the recognition area, which results in a lack of hand information in some video frames.

In step S202, it is detected whether the hand in the video frame contains a predetermined trigger action. Wherein the predetermined trigger action is a designed gesture or a set of gestures. By way of example, the predetermined trigger action may be specified in a product description or in a product guide interface to assist the user in understanding the predetermined trigger action.

In step S203, in response to detecting the predetermined trigger action, gesture recognition is performed. Wherein the detection of a predetermined trigger action implies the start of gesture recognition. If the gesture recognition is not performed before the predetermined trigger action is detected even if there is another gesture. In other words, when the user makes a predetermined trigger action, it means that the user is about to start making a meaningful gesture.

Based on the above, the gesture recognition is started only after the preset trigger action is detected, namely, the gesture recognition is performed when the trigger condition is met, and the gesture recognition is not performed until the trigger condition is met. In other words, only a specific trigger action needs to be detected before the trigger condition is met, and other actions except the trigger action do not need to be judged, so that the gesture recognition after the trigger condition can be focused, and therefore, the recognition efficiency and the recognition accuracy are high.

In one embodiment, the step S202 of detecting whether the hand in the video frame contains a predetermined trigger action may include: detecting fingertip information in the video frame; determining from the fingertip information whether a hand in the video frame includes a predetermined trigger action.

The hand information is of many kinds, and may include fingers, palms, joints, and the like, for example. In the following embodiments, a pointing gesture and a circling gesture are mainly described, and therefore, the tip of a finger is used as a recognition object.

Fig. 3 schematically shows a relationship diagram of the detection means and the corresponding fingertip information according to an embodiment of the present invention. As shown in fig. 3, the fingertip information includes a position and a category 302 of the fingertip, and a motion trajectory 304 of the fingertip. By way of example, the position and class 302 of the fingertip may be obtained using a target detection model. For a target detection model, a video frame is input to the target detection model, and then the positions and corresponding categories of all targets in the video frame can be output. The position of the target may be represented by a horizontal rectangular box, and the categories of the target may include six categories: hand and five fingertips. By using the target detection model, whether the hands and the fingers appear in each video frame and the positions of all the hands and the corresponding fingers can be obtained. The object detection model can be implemented by using the prior art, and is not expanded in detail here.

As shown in fig. 3, the position and type 302 of a fingertip are acquired by fingertip detection 301, and a motion trajectory 304 of the fingertip can be detected based on the fingertip detection 301. The motion track 304 of the fingertip is detected, which may also be referred to as fingertip tracking 303, and the motion track 304 of the fingertip may be a track of the fingertip in time sequence. Based on the method, when the equipment collects a plurality of hands, tracks of different hands can be distinguished; moreover, because the track of the fingertip on the time sequence has certain continuity, some possible fingertip detection errors can be eliminated on the basis of the track. The following is a detailed description.

Specifically, fingertip tracking 303 includes: estimating a predicted position of the fingertip in a next frame based on a historical position of the fingertip; and determining the track of the fingertip in time sequence based on the predicted position and the detected position of the fingertip in the next frame.

By way of example, the position of the fingertip can be represented by fingertip coordinates, the trajectory of the fingertip in time series can be represented by a fingertip coordinate sequence, and the predicted position of the fingertip can be represented by fingertip coordinate estimation. Wherein estimating the predicted position of the fingertip in the next frame based on the historical position may comprise: the predicted position is estimated using an algorithm such as a second order difference method or an extrapolation method. In one embodiment, if the fingertip coordinate at time t is (x) _t ，y _t ) The fingertip coordinate at time t-1 is (x) _t-1 ，y _t-1 ) The fingertip coordinate at time t-2 is (x) _t-2 ，y _t-2 ) Then, the fingertip coordinate at time t +1 is estimated to be (3 ×) (1) _t -3x _t-1 +x _t-2 ，3y _t -3y _t-1 +y _t-2 )。

Wherein determining the temporal trajectory of the fingertip based on the predicted position and the detected position of the fingertip in the next frame comprises: adding the detection position of the fingertip in the next frame into the track of the fingertip on the time sequence in response to the difference between the predicted position and the detection position of the fingertip in the next frame being within a set range; and in response to the difference between the predicted position and the detected position of the fingertip in the next frame being out of the set range, adding the predicted position of the fingertip in the next frame to the track of the fingertip in time sequence.

That is, in order to obtain the fingertip track in the time sequence, the fingertip detection position and the fingertip prediction position in the next frame need to be compared, if the difference between the fingertip detection position and the fingertip prediction position in the next frame is smaller than a certain threshold (set range), it is considered that the matching is successful, and the fingertip detection position (fingertip coordinate) in the next frame is added to the end of the fingertip track in the time sequence; and if the difference between the fingertip detection position and the fingertip prediction position of the next frame is larger than a certain threshold value, the matching is considered to be failed, and the fingertip prediction position (the fingertip coordinate estimation) is added to the end of the fingertip track in the time sequence. The threshold value may be set according to a finger moving speed of a general user. The reason for the matching failure may be a mistake of the fingertip detection or a user moving the hand out of the identification area, so that the fingertip tracking method can eliminate some fingertip detection errors which may occur, so that the fingertip tracking has a certain fault tolerance, and the failure caused by a single mistake of the fingertip detection is avoided. And the fingertip tracking can distinguish the motion tracks of different hands in a scene that multiple hands exist simultaneously, so that the user does not need to be limited to only use one hand when performing gesture interaction, and even the user can be allowed to perform different gesture interactions with two hands simultaneously.

In addition, in one embodiment, the fingertip tracking method further includes a judgment of a tracking failure. For example, the determination of the tracking failure includes: if the matching is not successful in a plurality of (for example, three) consecutive video frames, it is determined that the tracking is failed. At this time, the trace of the fingertip in time sequence can be discarded, and the detection can be restarted. In addition, a prompt may be generated at this time to prompt the user that the triggering may be resumed.

In one embodiment, the step of detecting whether the hand in the video frame contains a predetermined trigger action in step S202 includes: and if the fingertip is detected to stay within a preset range in a preset number of video frames, determining that the hand in the video frames contains a preset trigger action.

By way of example, the predetermined trigger actions may include a point trigger action and a circle trigger action. Specifically, according to the results of the fingertip detection 301 and the fingertip tracking 302 shown in fig. 3, if it is detected that a plurality of fingertips (two or more fingertips) stay within a predetermined range in a preset number (three or more) of video frames, it is determined that a predetermined trigger included in a hand in the video frames is a pointing trigger. And if the single fingertip is detected to stay within the preset range in the preset number of video frames, determining that the preset trigger contained in the hand in the video frames is used as the circling trigger action.

If the predetermined trigger contained in the hand in the video frame is used as the pointing trigger action, the user is indicated to execute the pointing gesture. If a predetermined trigger contained in the hand in the video frame is taken as the circling trigger action, it indicates that the user is about to perform a circling gesture. That is, different predetermined trigger actions reflect different intentions of the user. For example, in the present embodiment, the pointing trigger action is that the two fingers (which may be the index finger and the middle finger) stay within a predetermined range (a smaller range) for a period of time (in consecutive video frames). The circling trigger action is that the single finger (which may be the index finger) stays within a predetermined range (a smaller range) for a period of time (in consecutive video frames). The gesture of two fingers is adopted in the pointing trigger action, so that the success rate of recognition is improved. Based on the above, the embodiment of the invention can simultaneously recognize two gestures, namely pointing and circling, by means of the design of different gestures.

The pointing trigger action and the circling trigger action described above may be presented in a user guidance interface of the product to teach the user. After detecting the predetermined trigger action, the pointing gesture and the circling gesture can be further recognized, respectively, as described in detail below.

FIG. 4 schematically illustrates a flow diagram of a method for gesture recognition in accordance with one embodiment of the present invention; here, the detection of the pointing trigger action is denoted as a pointing trigger S402, and the detection of the circling trigger action is denoted as a circling trigger S403.

In step S401, the above-described fingertip detection and fingertip tracking are performed on each video frame to obtain the position of the fingertip (e.g., fingertip coordinates) and the motion trajectory of the fingertip (e.g., a fingertip coordinate sequence). And if the corresponding preset trigger action is detected, performing corresponding gesture recognition.

In step S402, if a predetermined trigger is detected as a pointing trigger, pointing gesture recognition is started; in step S403, if a predetermined trigger is detected as a circling trigger, start circling gesture recognition; in step S404, if any predetermined trigger action is not detected, the process returns to step S401 to continue fingertip detection and fingertip tracking.

In step S4021, pointing gesture recognition is performed and a pointing result is returned, including: returning to the current trigger state and returning to the current pointing coordinates. Wherein the trigger state comprises a pointing trigger or a circling trigger. In step S4021, the returned trigger state is the pointing trigger, and the returned current pointing coordinate is the fingertip coordinate at the time of the pointing trigger (step S402).

In this embodiment, the fingertip coordinates at the time of pointing trigger are directly returned in step S4021, which is equivalent to that the pointing trigger action is not only the trigger condition but also the pointing gesture itself, thereby simplifying the process of pointing gesture recognition. In other embodiments, the pointing trigger condition and the pointing gesture may be divided into different motions, for example, after the pointing trigger is performed for a period of time, if the fingertip is detected to stay again, the coordinate of the fingertip at that time is used as the pointing result.

In step S403, if a predetermined trigger is detected as a circling trigger, circling gesture recognition is started. According to the above, the circling touch starts as follows: a single fingertip stays within a predetermined range within a preset number of video frames. In one embodiment, in step S403, the current fingertip coordinate may be recorded as the trigger position, or the starting point of the fingertip movement track, that is, the starting point of the circling track.

It is noted above that when an error occurs during the execution of a long gesture (e.g., circling), the interactivity may be degraded if the device fails to terminate the flow in time. Therefore, the embodiment of the invention adds a circle-canceling condition for terminating the current gesture recognition; and after the condition of canceling the circle drawing is detected, stopping gesture recognition, and starting a new effective gesture recognition process in time so as to improve the interactivity. Based on this, the circling gesture recognition of the present embodiment includes: in response to detecting a cancel circling condition, terminating current gesture recognition; and/or returning a gesture recognition result in response to detecting a complete circling condition.

As shown in fig. 4, the circling gesture recognition may include steps S4301 to S4304. In step S4301, fingertip detection and fingertip tracking continue to be performed on the current frame, and further determination is performed according to the obtained fingertip information.

In step S4302, if the circle-off condition is detected, the process directly jumps to step S401 to detect the fingertip information again, and waits for a new trigger condition to be generated.

For the user, the condition of canceling the circle drawing is also a preset gesture; for the device, the circle-drawing cancellation condition is a judgment logic (cancellation logic) in the circle-drawing gesture recognition process, so as to judge whether the current circle-drawing process is cancelled. And when the circle-drawing cancellation condition is detected, jumping out of a circle-drawing gesture recognition process, processing fingertip information of the next video frame and waiting for generation of the trigger condition again. If the circling cancellation condition is not detected, step S4303 is executed to determine whether the circling process is completed.

Specifically, the circling cancellation condition includes an active circling cancellation condition and a passive circling cancellation condition. The active cancellation circling condition includes any one of the following conditions: a multi-finger open gesture, a fist-making gesture, a swing gesture (side-to-side or up-and-down), and a move-out-of-recognition-zone gesture. The passive un-circling condition includes: the circling motion is detected to stop. Wherein detecting that the circling action stops comprises: fingertip tracking fails or fingers cannot be detected for a period of time.

The circle-drawing-canceling condition is a cancel logic, the addition of the cancel logic can enable a user to autonomously interrupt a circle-drawing process at any time when the user realizes that the circle-drawing process is wrong, and the user use is more flexible. The passive cancellation of the circling condition can timely terminate the circling gesture recognition in case of algorithm error or some special cases. In addition, after the passive un-circling condition is detected, a prompt (for example, by sound or display) can be given to the user to prevent the wrong result from causing trouble to the user, so that the user experience is improved.

In step S4303, if the circling completion condition is not detected, indicating that the user is still in the process of circling, returning to step S4031 to continue detecting the fingertip information. If the circling completion condition is detected, step S4034 is executed to return a circling result.

The completion of the circling condition is also a predetermined gesture for the user. For the device, the circle-drawing completion condition is another judgment logic (completion logic) in the circle-drawing gesture recognition process, and aims to judge whether the current circle-drawing process is completed; if the circling process is finished, a circling result needs to be returned, and if the circling process is not finished, detection is continued until the circling process is finished. Specifically, when the completion circling condition is detected, step S4304 is executed to return a circling result. And when the circling completion condition is not detected, skipping to the step S4301 to continue processing the fingertip information of the next video frame, and further executing the step S4302 to the step S4303 to repeatedly judge the cancel logic and the completion logic until the circling process is cancelled or the circling process is completed.

The circling completion condition comprises a condition (1) and a condition (2), wherein the condition (1): the motion trail of the fingertip passes through the starting point of the motion trail. Condition (2): the motion trail of the fingertip passes through any point of the motion trail.

Specifically, the condition (1) indicates that the locus circled by the user returns to the vicinity of the start point of the motion locus of the fingertip, and the corresponding circled area is an area surrounded by all fingertip coordinate sequences. Wherein the starting point of the motion track of the fingertip is the trigger position corresponding to step S403. That is, condition (1) indicates that the user has drawn a more standard circle, i.e., an "O" shaped circle. And (2) representing that the circled track of the user forms selfing, wherein the circled area is a closed area formed by selfing of a part of fingertip coordinate sequences. That is, condition (2) may allow the user to use a more flexible circling manner, such as a "6" type circle or a "9" type circle.

Wherein the specific determination of the condition (1) may be determined by comparing the end point of the motion trajectory with the start point of the motion trajectory. For example, the fingertip positions in the nearest multiple (e.g., three) video frames may be detected, and it is determined whether the fingertip positions in the video frames are all located within a preset distance from the starting point of the motion trajectory. The preset distance may be set according to actual conditions, for example, given by a threshold set in advance, and if the euclidean distance between the current fingertip position and the start point of the motion trajectory is smaller than the threshold, it is considered that the current fingertip is near the start point of the motion trajectory.

The specific judgment of the condition (2) can be realized by judging whether straight line segments approximate to the motion trajectory (i.e., the circling trajectory) of the fingertip intersect or not; for example, firstly, for a motion trajectory of a fingertip formed in a circling gesture recognition process, adjacent coordinate points in a fingertip coordinate sequence corresponding to the trajectory are connected to form a line segment group consisting of a series of straight line segments, and the line segment group is used as an approximation of the circling trajectory. And then judging whether the straight line segments intersect with each previous straight line segment one by using the last straight line segment. If the circles are not intersected, the circling tracks are not selfed; if the finger tip is intersected with a certain straight line segment, the circle drawing track is selfed, and a fingertip coordinate sequence between the two intersected straight line segments is returned to be used as a final circle drawing track. In other embodiments, other technical means in the prior art may be adopted to determine the condition (1) and the condition (2), and the present invention is not limited thereto.

In step S4304, a circling result is returned. Including returning to the current trigger state and returning to the current circling track. The current trigger state is a circling trigger state, and the current circling track may be a circling track corresponding to the condition (1) or a circling track corresponding to the condition (2).

According to the returned results of step S4201 and step S4304, which may be a coordinate sequence, the returned results of step S4201 may be used for the subsequent function. For example, in an application scenario of assisted learning, the returned coordinates or coordinate sequences may be matched with topics, and topics in an exercise book pointed by or circled by a user are collected; or matching the returned coordinates or coordinate sequences with paragraphs in the text, and reading out the paragraphs in the text pointed or circled by the user by using speech synthesis.

As shown in fig. 4, the cancel logic is first determined and then the complete logic is determined, but in another embodiment, the complete logic is first determined and then the cancel logic is determined.

In summary, in the method for gesture recognition according to the embodiment of the present invention, the various determination logics only depend on the coordinates of the fingertip, so that the method is simple and efficient to apply. The fingertip coordinates can be obtained from a target detection model in the prior art or in the future, so that the requirements of the invention can be met only by a common monocular camera.

Exemplary device

Having described the method of an exemplary embodiment of the present invention, next, a description will be given of a related product of the method for gesture recognition of an exemplary embodiment of the present invention with reference to fig. 5.

Fig. 5 schematically shows a schematic block diagram of an apparatus 1000 for gesture recognition according to an embodiment of the present invention. As shown in fig. 5, an apparatus 1000 for gesture recognition may include a processor 1001 and a memory 1002. Wherein the memory 1002 stores computer instructions for performing a method for gesture recognition according to an embodiment of the present invention. The computer instructions, when executed by the processor 1001, cause the apparatus 1000 to perform the method described above with respect to fig. 2 or 4. For example, in some embodiments, the apparatus for gesture recognition 1000 may perform a method for gesture recognition to recognize a gesture of a user.

In addition, in practical applications, the device 1000 for gesture recognition may further include a capture device for capturing a video frame. To present the results, the apparatus for gesture recognition 1000 may further comprise a display or the like for the apparatus or module to interact with the user. The configuration that the device 1000 for gesture recognition may have is not limited in the embodiments of the present invention.

It should be noted that although in the above detailed description several steps of the method for gesture recognition are mentioned, this division is only not mandatory. Indeed, the features and functions of two or more of the steps described above may be embodied in one step, according to embodiments of the invention. Conversely, the features and functions of one step described above may be further divided into embodiments by a plurality of steps.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Use of the verbs "comprise", "comprise" and their conjugations in this application does not exclude the presence of elements or steps other than those stated in this application. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims

1. A method for gesture recognition, comprising:

acquiring a video frame containing a hand in real time;

detecting whether a hand in the video frame contains a predetermined trigger action; and

and performing gesture recognition in response to detecting the predetermined trigger action.

2. The method of claim 1, wherein detecting whether a hand in the video frame contains a predetermined trigger action comprises:

detecting fingertip information in the video frame;

determining from the fingertip information whether a hand in the video frame includes a predetermined trigger action.

3. The method of claim 2, wherein detecting fingertip information in the video frame comprises:

detecting a position of the fingertip; and

and detecting the motion track of the fingertip.

4. The method of any of claims 2 to 3, wherein detecting whether a hand in the video frame contains a predetermined trigger action comprises:

and if the fingertip is detected to stay within a preset range in a preset number of video frames, determining that the hand in the video frames contains a preset trigger action.

5. The method of claim 4, wherein if it is detected that a plurality of fingertips stay within a predetermined range in a preset number of video frames, determining a predetermined trigger contained by a hand in the video frames as a pointing trigger action.

6. The method of claim 5, wherein in response to detecting the predetermined trigger action, performing gesture recognition comprises:

and determining that the preset trigger action contained in the hand in the video frame is a pointing trigger action, returning to the current trigger state and returning to the current pointing coordinate.

7. The method of claim 4, wherein if a single fingertip is detected to stay within a predetermined range within a preset number of video frames, determining a predetermined trigger contained by a hand in the video frames as a circling trigger action.

8. The method of claim 7, wherein in response to detecting the predetermined trigger action, performing gesture recognition comprises:

in response to detecting a cancel circling condition, terminating current gesture recognition; and/or

And returning a gesture recognition result in response to detecting the completion of the circle drawing condition.

9. An apparatus for gesture recognition, comprising:

a processor configured to execute program instructions; and

a memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform the method for gesture recognition according to any one of claims 1 to 8.

10. A computer readable storage medium having stored therein program instructions which, when loaded and executed by a processor, cause the processor to carry out a method for gesture recognition according to any one of claims 1 to 8.