CN111913585A

CN111913585A - Gesture recognition method, device, equipment and storage medium

Info

Publication number: CN111913585A
Application number: CN202010997902.2A
Authority: CN
Inventors: 李文栋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2020-11-10

Abstract

The application discloses a gesture recognition method, a gesture recognition device, gesture recognition equipment and a storage medium, and relates to the field of computer videos and artificial intelligence. The specific implementation scheme is as follows: acquiring video data containing hand movements; obtaining a target movement track based on the hand motion presented by the video data; obtaining text information corresponding to the hand motion presented by the video data based on the target moving track, and determining a control instruction corresponding to the text information, wherein the control instruction can instruct a target device to perform corresponding operation. So, can effectively discern the hand action, especially discern the handwriting action of separating the space, generate corresponding control command to richen gesture recognition's mode, and then richen gesture recognition's use scene, promote user experience.

Description

Gesture recognition method, device, equipment and storage medium

Technical Field

The application relates to the field of computers, in particular to the field of computer vision and artificial intelligence. The present application is also applicable to the field of autopilot.

Background

The existing human-computer interaction scene comprises a plurality of interaction modes, such as function keys and knobs, a touch screen, voice recognition, gesture recognition and the like, so that the application scene is enriched, and the user experience is improved; however, in the existing gesture recognition interaction mode, a touch screen needs to be combined mostly, so that the use scene is limited, for example, in the car machine interaction scene, if the touch screen is used to realize the interaction function, the potential safety hazard in the driving process is inevitably increased.

Disclosure of Invention

The application provides a gesture recognition method, a gesture recognition device, gesture recognition equipment and a storage medium.

According to an aspect of the present application, there is provided a gesture recognition method including:

acquiring video data containing hand movements;

obtaining a target movement track based on the hand motion presented by the video data;

obtaining text information corresponding to the hand motion presented by the video data based on the target moving track, and determining a control instruction corresponding to the text information, wherein the control instruction can instruct a target device to perform corresponding operation.

According to another aspect of the present application, there is provided a gesture recognition apparatus including:

the data acquisition unit is used for acquiring video data containing hand movements;

the track determining unit is used for obtaining a target moving track based on the hand motion presented by the video data;

the text information determining unit is used for obtaining text information corresponding to the hand action presented by the video data based on the target moving track;

and the instruction determining unit is used for determining a control instruction corresponding to the text information, wherein the control instruction can instruct the target equipment to perform corresponding operation.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method described above.

According to the scheme, the hand action is effectively recognized, for example, the spaced handwriting action is recognized, and the corresponding control instruction is generated, so that the gesture recognition modes are enriched, the use scenes of the gesture recognition are also enriched, and a foundation is laid for simplifying the user operation and improving the user experience.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a first flowchart illustrating a method for gesture recognition according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a second implementation flow of a gesture recognition method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a gesture recognition apparatus according to an embodiment of the present application;

FIG. 4 is a block diagram of an electronic device for implementing a gesture recognition method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present application provides a gesture recognition method, which is applied to a gesture recognition apparatus, and specifically, fig. 1 is a schematic flow chart of an implementation of the gesture recognition method according to an embodiment of the present application, and as shown in fig. 1, the method includes:

step S101: video data including hand movements is acquired.

Step S102: and obtaining a target movement track based on the hand motion presented by the video data.

Step S103: obtaining text information corresponding to the hand motion presented by the video data based on the target moving track, and determining a control instruction corresponding to the text information, wherein the control instruction can instruct a target device to perform corresponding operation.

Therefore, the hand action can be effectively recognized, for example, the spaced handwriting action is recognized, the corresponding text information is obtained, and then the control instruction corresponding to the text information is obtained, so that the gesture recognition mode is enriched, the gesture recognition use scene is enriched, and a foundation is laid for simplifying the user operation and improving the user experience.

Here, in practical applications, a mapping relationship between the text information and the control instruction may be preset, and after the text information is determined, the control instruction corresponding to the text information may be determined based on a preset mapping relationship table, so that the control operation is completed.

In practical applications, the text information mainly includes characters, for example, includes a character, or includes a character string, and the characters may be specific to symbols, letters, or words, so that the control command corresponding to the text information is determined based on the semantic meaning represented by the text information.

Here, the scheme of the application can be applied to a car machine interaction scene, for example, in the driving process of a car, the scheme of the application is adopted, a driver does not need to operate a touch screen or touch keys, the car machine, such as an on-board device and the like, can be controlled through the spaced hand action, and thus compared with the existing touch operation or key operation mode, the scheme of the application enriches the human-computer interaction mode, improves the user experience, and lays a foundation for meeting the use requirements of different users. Moreover, due to the fact that the scheme can identify the space handwriting and further achieve the space handwriting input function, potential safety hazards in the driving process can be effectively avoided in the vehicle-mounted machine interaction scene, and a foundation is laid for improving driving safety.

In a specific example of the scheme of the application, in consideration of a usage habit of a user in an actual scene, for example, the habit is to use a finger to perform space writing, and to improve an accuracy rate of recognition and avoid invalid recognition, in the process of determining the trajectory, only the movement trajectory of the finger may be determined, and then the movement trajectory of the finger is taken as a target movement trajectory, specifically, step S102 may specifically include: obtaining a target movement track based on the finger movement characteristics in the hand action presented by the video data; therefore, an effective target moving track is obtained, and a foundation is laid for subsequently improving the accuracy of text recognition.

In a specific example of the scheme of the application, in order to further improve the recognition accuracy, a model trained in advance can be used for text recognition, so that on one hand, the recognition efficiency can be improved, and on the other hand, the accuracy of a recognition result is also improved. Specifically, step S102 may specifically include: inputting a video frame sequence at least containing hand motion in the video data to a preset neural network model to obtain the target moving track; the preset neural network model is obtained after a sample video marked with a moving track is trained, the sample video comprises hand actions, and the marked moving track is matched with the hand actions of the sample video.

In practical application, in consideration of the problem of model processing efficiency, the acquired video data can be preprocessed to remove video frames which do not include hand motions, so as to obtain a video frame sequence including the hand motions, and then only the video frame sequence including the hand motions is input into a preset neural network model for track recognition, so that the model processing efficiency is improved.

Of course, in an actual scene, in consideration of the image processing capability and the computing capability of the device, all video data may also be input to the preset neural network model for identifying the movement track, and the movement track may be selected based on the actual processing capability of the device, which is not limited in the present application.

In consideration of the use habits of users in actual scenes, for example, the habits of using fingers to write in space, the method can only identify the video frame sequence containing the finger movement characteristics in the video data to obtain the target movement track, so that the identification efficiency is improved, and meanwhile, a foundation is laid for subsequently improving the accuracy of the identification result.

In a specific example of the present application, when performing text recognition, a text recognition model may be used to recognize a target movement track, and specifically, in step S103, obtaining text information corresponding to a hand motion presented by the video data based on the target movement track includes: inputting the target movement track into a preset recognition model to obtain probability characteristics (such as probability values) of representing preset characters by the target movement track; and determining character information represented by the target movement track based on the probability characteristic that the target movement track represents preset characters so as to obtain text information corresponding to the hand action presented by the video data. Here, the preset recognition model is trained based on a mapping relationship between the movement trajectory and the character.

That is to say, in this example, a preset recognition model (for example, a text recognition model) is utilized to obtain a probability value that the target movement track belongs to a certain character or character string, and then determine text information corresponding to the target movement track, for example, text content corresponding to the character or character string with the probability value greater than a preset threshold is used as the text information of the target movement track, so that recognition efficiency is improved and accuracy of a recognition result is improved.

In a specific example of the present application, the gesture recognition apparatus implementing the gesture recognition method can also control a state of the image capturing device, for example, as shown in fig. 2, including:

step S001: and detecting a starting instruction. In practical applications, the starting instruction may be specifically an instruction generated based on a voice input, or a trigger generated based on other user operations, which is not limited in the present application.

Step S002: and responding to the starting instruction, and triggering the image acquisition equipment indicated by the starting instruction to acquire the image of the hand motion in the acquisition area to obtain video data.

Therefore, the intellectualization of gesture recognition is further improved, and a foundation is laid for effectively recognizing hand actions. Moreover, the image acquisition equipment is triggered and started based on the starting instruction, so that the scheme of the application can be flexibly configured based on the requirements of actual demand scenes, and thus, a foundation is laid for meeting the user requirements of different scenes and improving and enriching the user experience.

In a specific example of the scheme of the application, the control instruction may be a corresponding instruction in a vehicle-mounted device environment, and at this time, the vehicle indicated by the control instruction may be correspondingly operated based on the control instruction, for example, an air conditioning device is controlled to perform temperature adjustment; or, based on the control instruction, performing corresponding operation on the vehicle-mounted device in the vehicle indicated by the control instruction, for example, controlling the smart sound box to play music. Therefore, the user operation is simplified, the human-computer interaction requirement of the user is met, and the user experience is further improved.

The following further detailed description is provided with reference to specific examples, and specifically, the present application combines an image gesture recognition technology with a handwriting recognition input method technology, collects a gesture image of a user through a camera, recognizes a character or a character string input by the user in an air space, and transposes the recognized character or character string into a corresponding software function, thereby implementing an air space handwriting recognition input function similar to voice recognition, greatly expanding a vehicle-mounted image recognition interaction function, avoiding potential safety hazards in a driving process, and laying a foundation for improving driving safety.

The present example specifically contains several key conditions and steps:

under the environment of a vehicle-mounted vehicle, camera hardware suitable for image acquisition and a matched software environment are configured, for example, the device for realizing the scheme of the application is integrated into a central control of the vehicle, so that an isolated handwriting recognition input function can be realized by utilizing the central control. Based on the method, after the camera is started (for example, the camera is started by itself for image acquisition after the vehicle is started, or the camera is triggered to be started based on user operation, or the camera can be controlled to be started based on voice of a user), the user writes characters (such as characters and the like) through gestures at intervals, for example, 'songs are played', at the moment, the camera can capture handwriting gesture videos of the user, and then the handwriting gesture videos are recognized through a first convolution neural network of an image gesture recognition technology, and tracks drawn by fingers of the user are recognized; furthermore, the recognized track can be recognized into a corresponding text through a second convolutional neural network which realizes a handwriting input recognition technology; and then the recognized text is transferred to corresponding software function operation, for example, the text is transferred to a control instruction for starting a music player, and then the music player is controlled to randomly play songs, so that man-machine interaction based on the air gesture is realized.

According to the scheme, the image gesture recognition technology, the handwriting input recognition technology and the semantic analysis technology are combined, the purpose of gesture recognition is greatly expanded, the method and the device are particularly suitable for scenes that a driver needs to interact with a vehicle machine in a vehicle-mounted environment, and meanwhile interaction safety is improved.

In addition, in the identification process of the scheme, the identification can be realized without the need of memorizing special gestures by a user, so that the convenience is improved, and the user operation is simplified.

The present application further provides a gesture recognition apparatus, as shown in fig. 3, including:

a data acquisition unit 301 configured to acquire video data including a hand motion;

a track determining unit 302, configured to obtain a target movement track based on the hand motion presented by the video data;

a text information determining unit 303, configured to obtain text information corresponding to the hand motion presented by the video data based on the target movement trajectory;

an instruction determining unit 304, configured to determine a control instruction corresponding to the text information, where the control instruction is capable of instructing a target device to perform a corresponding operation.

In a specific example of the scheme of the application, the trajectory determination unit is further configured to obtain a target movement trajectory based on a finger movement feature in the hand motion presented by the video data.

In a specific example of the application, the trajectory determination unit is further configured to input at least a sequence of video frames including a hand motion in the video data to a preset neural network model to obtain the target movement trajectory; the preset neural network model is obtained after a sample video marked with a moving track is trained, the sample video comprises hand actions, and the marked moving track is matched with the hand actions of the sample video.

In a specific example of the scheme of the present application, the text information determining unit includes:

the model subunit is used for inputting the target movement track into a preset recognition model to obtain the probability characteristic that the target movement track represents preset characters;

and the character processing subunit is configured to determine character information represented by the target movement trajectory based on the probability characteristic that the target movement trajectory represents a preset character, so as to obtain text information corresponding to the hand motion presented by the video data.

In a specific example of the scheme of the present application, the method further includes:

the starting unit is used for detecting a starting instruction;

and the image acquisition unit is used for responding to the starting instruction and triggering the image acquisition equipment indicated by the starting instruction to acquire images of the hand movements in the acquisition area to obtain video data.

the control unit is used for carrying out corresponding operation on the vehicle indicated by the control instruction based on the control instruction; or based on the control instruction, performing corresponding operation on the vehicle-mounted equipment in the vehicle indicated by the control instruction.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 4 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.

Memory 402 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the gesture recognition methods provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the gesture recognition method provided herein.

The memory 402, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the gesture recognition method in the embodiment of the present application (for example, the data acquisition unit 301, the trajectory determination unit 302, the text information determination unit 303, and the instruction determination unit 304 shown in fig. 3). The processor 401 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 402, that is, implements the gesture recognition method in the above method embodiment.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device by the gesture recognition method, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 402 may optionally include memory located remotely from the processor 401, which may be connected to the gesture recognition method electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the gesture recognition method may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the gesture recognition method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

According to the technical scheme of the embodiment of the application, the hand action can be effectively recognized, for example, the spaced handwriting action is recognized, the corresponding text information is obtained, and the control instruction corresponding to the text information is obtained, so that the gesture recognition modes are enriched, the use scenes of the gesture recognition are enriched, and a foundation is laid for simplifying the user operation and improving the user experience.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A gesture recognition method, comprising:

acquiring video data containing hand movements;

2. The method of claim 1, wherein the deriving a target movement trajectory based on hand motion presented by the video data comprises:

and obtaining a target movement track based on the finger movement characteristics in the hand action presented by the video data.

3. The method of claim 1 or 2, wherein the deriving a target movement trajectory based on hand motion presented by the video data comprises:

inputting a video frame sequence at least containing hand motion in the video data to a preset neural network model to obtain the target moving track; the preset neural network model is obtained after a sample video marked with a moving track is trained, the sample video comprises hand actions, and the marked moving track is matched with the hand actions of the sample video.

4. The method of claim 1, wherein the obtaining text information corresponding to the hand motion presented by the video data based on the target movement trajectory comprises:

inputting the target moving track into a preset recognition model to obtain the probability characteristic that the target moving track represents preset characters;

and determining character information represented by the target movement track based on the probability characteristic that the target movement track represents preset characters so as to obtain text information corresponding to the hand action presented by the video data.

5. The method of claim 1, further comprising:

detecting a starting instruction;

and responding to the starting instruction, and triggering the image acquisition equipment indicated by the starting instruction to acquire the image of the hand motion in the acquisition area to obtain video data.

6. The method of claim 1 or 5, further comprising:

based on the control instruction, performing corresponding operation on the vehicle indicated by the control instruction; or,

and performing corresponding operation on the vehicle-mounted equipment in the vehicle indicated by the control instruction based on the control instruction.

7. A gesture recognition apparatus comprising:

8. The apparatus according to claim 7, wherein the trajectory determination unit is further configured to obtain a target movement trajectory based on finger movement characteristics in the hand motion presented by the video data.

9. The apparatus according to claim 7 or 8, wherein the trajectory determining unit is further configured to input at least a sequence of video frames including hand movements in the video data to a preset neural network model to obtain the target movement trajectory; the preset neural network model is obtained after a sample video marked with a moving track is trained, the sample video comprises hand actions, and the marked moving track is matched with the hand actions of the sample video.

10. The apparatus of claim 7, wherein the text information determining unit comprises:

11. The apparatus of claim 7, further comprising:

the starting unit is used for detecting a starting instruction;

12. The apparatus of claim 7 or 11, further comprising:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.