CN116092110A

CN116092110A - Gesture semantic recognition method, electronic device, storage medium and program product

Info

Publication number: CN116092110A
Application number: CN202111269531.7A
Authority: CN
Inventors: 邵笑飞
Original assignee: Beijing Jigan Technology Co ltd
Current assignee: Beijing Jigan Technology Co ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2023-05-09

Abstract

The invention provides a gesture semantic recognition method, electronic equipment, a storage medium and a program product, wherein the method comprises the following steps: performing hand position detection on the image frames in the image frame sequence to be identified to obtain target image frames; the target image frame is an image frame comprising a hand lifting action; and carrying out gesture semantic recognition on the target image frame through a semantic recognition neural network to obtain a gesture semantic recognition result. According to the gesture semantic recognition method provided by the embodiment of the invention, before the neural network judges the gesture, the wrong gesture is filtered through the filtering strategy, and the gesture made during the filtered real hand lifting action is input into the semantic recognition neural network, so that the semantic recognition neural network only recognizes an effective target image, the gesture semantic recognition time is reduced, and meanwhile, the occurrence of the wrong recognition can be reduced, thereby effectively improving the accuracy of the gesture semantic recognition result.

Description

Gesture semantic recognition method, electronic device, storage medium and program product

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a gesture semantic recognition method, an electronic device, a storage medium, and a program product.

Background

With the continuous development of artificial intelligence, the combination of artificial intelligence technology and electronic products has become more widespread, and intelligent control devices such as intelligent televisions, somatosensory game machines, intelligent tablet computers and the like have emerged. The existing intelligent control equipment collects images through a video device and controls according to gesture information in the images, and in the control process, how to accurately identify user gestures and reduce the occurrence of false identification is an important research content.

In the prior art, a common gesture recognition mode is to directly predict a gesture of a user by using a trained neural network model, and predict the purpose of gesture characterization according to different gestures, so as to control intelligent control equipment according to different purposes. However, in an actual application scenario, some states are other actions made by the user when the user does not send out a control gesture, and the gesture in some states may be relatively close to the target semantics, which easily causes misjudgment of the neural network, for example, the hand lifting state caused by the fact that the hand of the user is in the process of lifting and falling, or the hand lifting state caused by holding an object, may be misjudged to be the hand lifting state, so that the gesture judgment result of the neural network is inaccurate, and further, the intelligent control device executes an error instruction.

Disclosure of Invention

Accordingly, the present invention is directed to a gesture semantic recognition method, an electronic device, a storage medium, and a program product, so as to reduce the time of gesture semantic recognition, reduce the occurrence of false recognition, and improve the accuracy of gesture semantic recognition results.

In a first aspect, an embodiment of the present invention provides a gesture semantic recognition method, where the method includes: performing hand position detection on the image frames in the image frame sequence to be identified to obtain target image frames; the target image frame is an image frame comprising a hand lifting action; and carrying out gesture semantic recognition on the target image frame through a semantic recognition neural network to obtain a gesture semantic recognition result.

Further, the step of detecting the hand position of the image frame in the image frame sequence to be identified to obtain the target image frame includes: the following operations are respectively carried out on the image frames in the image frame sequence to be identified: acquiring position information of a hand association point of a target object in an image frame; the hand association points comprise palm key points and elbow key points corresponding to the palm key points; judging whether the target object performs a hand lifting action or not according to the position information of the hand association points; if so, the image frame is determined to be the target image frame.

Further, the step of determining whether the target object performs the hand lifting action according to the position information of the hand association point includes: subtracting the position of the elbow key point corresponding to the palm key point from the position of the palm key point to obtain a hand position difference; and if the hand position difference is greater than zero, determining that the target object performs the hand lifting action.

Further, the palm keypoints comprise a left palm keypoint and a right palm keypoint, and the elbow keypoint comprises a left elbow keypoint and a right elbow keypoint; the step of subtracting the position of the elbow key point corresponding to the palm key point from the position of the palm key point to obtain the hand position difference comprises the following steps: subtracting the position of the right elbow key point from the position of the right palm key point to obtain a right hand position difference; subtracting the position of the left elbow key point from the position of the left palm key point to obtain a left hand position difference; if the hand position difference is greater than zero, determining that the target object performs the hand lifting action, including: and if the right hand position difference or the left hand position difference is larger than zero, determining that the target object performs the hand lifting action.

Further, the step of determining whether the target object performs the hand lifting action according to the position information of the hand association point includes: acquiring face position information of a target object in an image frame; and judging whether the target object performs the hand lifting action or not according to the position information of the hand association points and the face position information.

Further, the position information of the hand association point further includes a hand position frame, and the face position information is a face position frame; judging whether the target object performs a hand lifting action or not according to the position information of the hand association points and the face position information, wherein the step comprises the following steps: subtracting the position of the upper frame of the hand position frame from the position of the lower frame of the face position frame to obtain a face position difference; subtracting the position of the elbow key point corresponding to the palm key point from the position of the palm key point to obtain a hand position difference; and judging whether the target object performs the hand lifting action or not according to the hand position difference and the face position difference.

Further, the step of determining whether the target object performs the hand lifting action according to the hand position difference and the face position difference includes: if the face position difference is less than the face position difference threshold, and/or if the hand position difference is greater than zero, determining that the target object performs a hand-lifting motion.

Further, the step of performing gesture semantic recognition on the target image frame through the semantic recognition neural network to obtain a gesture semantic recognition result includes: determining a hand sub-image containing a hand of the target object from the target image frame; and inputting the hand sub-images into a semantic recognition neural network according to the time sequence to obtain a gesture semantic recognition result.

Further, the target object is an object matched with an operator prestored in an image acquisition device, and the image acquisition device is a device for acquiring an image frame to be identified.

Further, the pre-stored operator is the operator closest to the image capturing apparatus.

In a second aspect, an embodiment of the present invention further provides an electronic device, including a processor and a memory, where the memory stores computer executable instructions that can be executed by the processor, where the processor executes the computer executable instructions to implement the gesture semantic recognition method of the first aspect.

In a third aspect, embodiments of the present invention further provide a computer-readable storage medium storing computer-executable instructions that, when invoked and executed by a processor, cause the processor to implement the gesture semantic recognition method of the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program, and when the computer program is executed by a processor, implements the gesture semantic recognition method of the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

The gesture semantic recognition method, the electronic device, the storage medium and the program product provided by the embodiment of the invention are used for detecting the hand positions of the image frames in the image frame sequence to be recognized to obtain target image frames; the target image frame is an image frame comprising a hand lifting action; according to the gesture semantic recognition technology provided by the embodiment of the invention, before the neural network judges the gesture, the error gesture is filtered through the filtering strategy, and the gesture made during the filtered real hand lifting action is input into the semantic recognition neural network, so that the semantic recognition neural network only recognizes an effective target image, the gesture semantic recognition time is reduced, and meanwhile, the occurrence of the false recognition can be reduced, thereby effectively improving the accuracy of the gesture semantic recognition result.

Additional features and advantages of the disclosure will be set forth in the description which follows, or in part will be obvious from the description, or may be learned by practice of the techniques of the disclosure.

The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of an electronic system according to an embodiment of the present invention;

FIG. 2 is a flowchart of a gesture semantic recognition method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a human body structure according to an embodiment of the present invention;

FIG. 4 is a flowchart of a scenario application of a gesture semantic recognition method according to an embodiment of the present invention;

FIG. 5 is a flowchart of another gesture semantic recognition method according to an embodiment of the present invention;

FIG. 6 is a schematic illustration of another human body structure according to an embodiment of the present invention;

FIG. 7 is a flowchart of a dynamic gesture semantic recognition method according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a gesture semantic recognition apparatus according to an embodiment of the present invention;

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In recent years, technology research such as computer vision, deep learning, machine learning, image processing, image recognition and the like based on artificial intelligence has been advanced significantly. Artificial intelligence (Artificial Intelligence, AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human intelligence. The artificial intelligence discipline is a comprehensive discipline and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning, neural networks and the like. Computer vision is an important branch of artificial intelligence, and particularly, machine recognition is a world, and computer vision technologies generally include technologies such as face recognition, living body detection, fingerprint recognition and anti-counterfeit verification, biometric feature recognition, face detection, pedestrian detection, object detection, pedestrian recognition, image processing, image recognition, image semantic understanding, image retrieval, word recognition, video processing, video content recognition, behavior recognition, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map building (SLAM), computational photography, robot navigation and positioning, and the like. With research and progress of artificial intelligence technology, the technology expands application in various fields, such as security protection, city management, traffic management, building management, park management, face passing, face attendance, logistics management, warehouse management, robots, intelligent marketing, computed photography, mobile phone images, cloud services, intelligent home, wearing equipment, unmanned driving, automatic driving, intelligent medical treatment, face payment, face unlocking, fingerprint unlocking, personnel verification, intelligent screen, intelligent television, camera, mobile internet, network living broadcast, beauty, make-up, medical beauty, intelligent temperature measurement and the like.

Based on the fact that the intelligent control device is easy to execute error instructions by the existing method for directly inputting the images containing the gestures into the neural network model to predict the gestures of the user, the gesture semantic recognition method, the gesture semantic recognition device and the electronic device can reduce gesture semantic recognition time, reduce occurrence of error recognition and improve accuracy of gesture semantic recognition results.

Referring to fig. 1, a schematic diagram of an electronic system 100 is shown. The electronic system can be used for realizing the gesture semantic recognition method and device of the embodiment of the invention.

As shown in fig. 1, an electronic system 100 includes one or more processing devices 102, one or more storage devices 104, an input device 106, an output device 108, and one or more image capture devices 110, interconnected by a bus system 112 and/or other forms of connection mechanisms (not shown). It should be noted that the components and configuration of the electronic system 100 shown in fig. 1 are exemplary only and not limiting, as the electronic system may have other components and configurations as desired.

The processing device 102 may be a server, a smart terminal, or a device that includes a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, may process data from other components in the electronic system 100, and may also control other components in the electronic system 100 to perform gesture-semantic recognition functions.

The storage 104 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and the processing device 102 may execute the program instructions to implement client functions and/or other desired functions in embodiments of the present invention described below (implemented by the processing device). Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, mouse, microphone, touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The image capture device 110 may acquire an image frame to be processed and store the image frame to be processed in the storage 104 for use by other components.

Illustratively, the components used to implement the gesture semantic recognition method, apparatus and electronic system according to the embodiments of the present invention may be integrally disposed, or may be disposed in a scattered manner, such as integrally disposing the processing device 102, the storage device 104, the input device 106 and the output device 108, and disposing the image capturing device 110 at a designated location where an image may be captured. When the devices in the above-described electronic system are integrally provided, the electronic system may be implemented as an intelligent terminal such as a camera, a smart phone, a tablet computer, a vehicle-mounted terminal, or the like.

Fig. 2 is a flowchart of a gesture semantic recognition method according to an embodiment of the present invention, as shown in fig. 2, the method includes the following steps:

s202: performing hand position detection on the image frames in the image frame sequence to be identified to obtain target image frames; the target image frame is an image frame comprising a hand lifting action;

the gesture semantic recognition method provided by the embodiment of the invention can be applied to electronic equipment with an intelligent control function, such as an intelligent television, an intelligent tablet computer and the like. Taking the electronic equipment as an intelligent television as an example, the intelligent television acquires an image frame sequence with gesture actions made by an operator through a camera device, for example, the intelligent television can acquire an image frame sequence formed by a plurality of images within 10 seconds each time, and semantic recognition is carried out according to the image frame sequence.

The gesture may be a static action, such as a "V" gesture indicating that the operator wishes to return to the home page of the television, an "O" gesture indicating that the operator confirms the current selection, or a "hiss" gesture indicating that the operator wishes to mute the television, etc. Gesture motion may also be dynamic, such as sliding the open palm left to indicate that the operator wishes to back to the previous interface, and sliding the open palm right to indicate that the operator wishes to enter the next interface. It should be noted that, when the gesture motion represents a dynamic motion, gesture semantic recognition needs to be performed based on a video frame sequence within a preset time period or a video frame sequence composed of a preset number of video frame images. The specific dynamic gesture semantic recognition method will be described in detail later, and will not be described here again.

As described above, in a plurality of video frames in a video frame sequence, the gesture made by the operator in some video frames is not the semantic meaning that the operator wants to control the smart tv, for example, the operator temporarily takes something, or some gestures that the hand of the operator unintentionally makes during the process of lifting or dropping, and the embodiment of the present invention performs the hand position detection on each image frame in the image frame sequence, and only if the hand position detection result indicates that the action made by the operator in the image frame is made in the state of lifting the hand, the image frame is determined as the target image frame.

S204: and carrying out gesture semantic recognition on the target image frame through a semantic recognition neural network to obtain a gesture semantic recognition result.

Inputting the identified target image frame into a semantic identification neural network to obtain a gesture semantic identification result, wherein the gesture semantic identification result can be V-shaped, O-shaped or right sliding dynamic gestures and the like. The semantic recognition neural network may be a neural network model obtained by training a sample image including various gestures, and the specific structure of the neural network model is not limited in the embodiment of the present invention.

According to the gesture semantic recognition method provided by the embodiment of the invention, the target image frame is obtained by detecting the hand position of the image frame in the image frame sequence to be recognized; and carrying out gesture semantic recognition on the target image frame through a semantic recognition neural network to obtain a gesture semantic recognition result. By the gesture semantic recognition method, gestures made by the target object in an unreal hand-lifting state can be filtered from the image frame sequence to be recognized, so that the semantic recognition neural network can only recognize the effective target image, the gesture semantic recognition time is reduced, and meanwhile, the occurrence of false recognition can be reduced, and the accuracy of a semantic recognition result is effectively improved.

In some possible embodiments, the step of performing the hand position detection on the image frames in the image frame sequence to be identified to obtain the target image frame may specifically be performing the following operations on the image frames in the image frame sequence to be identified:

(1) Acquiring position information of a hand association point of a target object in an image frame; the hand association points comprise palm key points and elbow key points corresponding to the palm key points;

the target object is an operator making a gesture in the image frame, and for ease of understanding and presentation, embodiments of the present invention employ the operator to represent the target object.

The position information of the hand association points can be obtained through prediction of a neural network, and can also be obtained through a general image recognition algorithm, wherein the palm key points can be central points of connection of the palm and the forearm, and the elbow key points can be central points of connection of the forearm and the forearm. It is understood that the palm keypoints include left hand palm keypoints and right hand palm keypoints, and the elbow keypoints include left hand elbow keypoints and right hand elbow keypoints.

(2) Judging whether the target object performs a hand lifting action or not according to the position information of the hand association points;

(3) If so, the image frame is determined to be the target image frame.

Specifically, the position of the elbow key point corresponding to the palm key point may be subtracted from the position of the palm key point to obtain a hand position difference; if the hand position difference is greater than or equal to zero, determining that the target object performs a hand lifting action; otherwise, if the hand position difference is less than or equal to zero, indicating that the target object is not lifting hands, the image frame is not the target image frame.

The hand position difference may be obtained by determining the position information of the hand related point on the left side or the position information of the hand related point on the right side, for example, the position of the right hand elbow key point may be subtracted from the position of the right hand palm key point to obtain the right hand position difference, and if the right hand position difference is greater than zero, it is indicated that the right hand is a lifting motion, and the image frame is the target image frame. Correspondingly, the position of the left elbow key point can be subtracted from the position of the left palm key point to obtain a left hand position difference, if the left hand position difference is greater than zero, the left hand is indicated to be in a hand lifting action, and the image frame is a target image frame; of course, the position information of the left hand related point and the right hand related point may be determined at the same time, and if one of the left hand position difference and the right hand position difference is greater than zero, the description may be made that the side is the hand lifting operation and the image frame is the target image frame.

Fig. 3 is a schematic diagram of a human body structure including a human hand association point, in fig. 3, 1-7 are the hand association points, wherein 1 represents a junction position of a head and a neck, 2 represents a left hand palm key point of an operator, 3 represents a left hand elbow key point, 4 represents a left forearm key point, and correspondingly, 5 represents a right hand palm key point, 6 represents a right hand elbow key point, and 7 represents a right forearm key point. Fig. 3 may be obtained by extracting key points through an image recognition neural network, or may be obtained by adopting a scale-invariant feature transform (Scale invariant feature transform, SIFT) algorithm, and the method for extracting feature points corresponding to human bones in the embodiment of the present invention is not limited.

The following specifically describes how to determine a target image frame through a hand association point according to an embodiment of the present invention with reference to fig. 3, and fig. 4 is a scene application flowchart of a gesture semantic recognition method provided by the embodiment of the present invention, where the method is described by taking an example that an operator actually lifts the right hand to make a V-type gesture, and the method includes the following steps:

s402: the camera of the intelligent television acquires 3 frames of images within 1 second before the current moment;

s404: and judging the relative position relation between the hand association points for the 1 st frame of image.

Specifically, the relative positional relationship between the key points 5 and 6, and the relative positional relationship between the key points 2 and 3 are determined, respectively.

S406: it is determined that the operator does not raise the left or right hand in the 1 st frame image.

It was judged that the position of the key point 5 was found to be smaller than the position of the key point 6 and the position of the key point 2 was found to be smaller than the position of the key point 3, thus indicating that the operator did not raise the left hand or raise the right hand in the 1 st frame image.

S408: and continuously judging the 2 nd frame image, and judging the relative position relation between the hand association points.

Also, the relative positional relationship between the key points 5 and 6, and the relative positional relationship between the key points 2 and 3 are judged.

S410: it is determined that the operator lifts the right hand in the 2 nd frame image.

By judging, the position of the key point 5 is found to be higher than the position of the key point 6, namely, the result of subtracting the position of the key point 6 from the position of the key point 5 is larger than zero, which indicates that the operator lifts the right hand in the 2 nd frame image.

S412: and continuously judging the 3 rd frame image, and judging the relative position relation between the hand association points.

S414: it is determined that the operator lifts the right hand in the 3 rd frame image.

S416: and inputting the 2 nd frame and the 3 rd frame images into a semantic recognition neural network to obtain a V-shaped gesture as a recognition result, and further sending a command for returning to the main page by the intelligent television according to the semantics of the V-shaped gesture, wherein the main page is displayed by the intelligent television.

Through the judgment, in the 2 nd frame and the 3 rd frame images, the operator lifts the right hand, the 2 nd frame and the 3 rd frame images are target images, and the target images are input into the semantic recognition neural network to obtain a recognition result. Because no 1 st frame image of non-lifting motion is input, the recognition speed and efficiency of the neural network are improved, the interference of the 1 st frame image on the recognition result is reduced, and the accuracy of the obtained recognition result is higher.

It should be noted that in a practical application scenario, an operator may lift the left hand and the right hand at the same time, in which case, the corresponding image frames when the operator lifts the left hand and the right hand at the same time are still target image frames, the target image frames are input into the semantic recognition neural network, if the result obtained through the semantic recognition neural network shows that the left hand and the right hand make the same gesture, the gesture of the left hand and the gesture of the right hand are regarded as one gesture, and the corresponding semantics are determined based on the gesture; if different gestures are made by the left hand and the right hand, the corresponding semantics can be determined according to a rule preset by the intelligent television, for example, based on the right hand gesture.

In a practical application scenario, some references in the background may sometimes be mistaken for elbow joints, for example, an operator sitting in a chair, a certain point of the chair may be mistaken for a skeletal point of the operator, or a plurality of people face a camera, and the skeletal points of other people may be mistaken for skeletal points of the operator, etc. In order to eliminate the influence of such images on the gesture semantic recognition process and result, and obtain a more accurate recognition result, another gesture semantic recognition method is provided in the embodiment of the present invention, and referring to a flowchart of another gesture semantic recognition method shown in fig. 5, the method includes the following steps:

S502: acquiring position information of a hand association point of a target object in an image frame; the hand association points comprise palm key points and elbow key points corresponding to the palm key points;

s504: acquiring face position information of a target object in an image frame;

s506: judging whether the target object performs a hand lifting action or not according to the position information and the face position information of the hand association points;

in this step, the position of the elbow key point corresponding to the palm key point may be subtracted from the position of the palm key point to obtain the hand position difference. The judgment of the hand position difference may refer to the above-mentioned hand position difference judgment method provided by the embodiment of the present invention, and will not be described herein.

Since the lifting motion is generally considered to be an effective lifting motion when the highest point of the hand is higher than the lowest point of the face, the position information of the hand-related point in the embodiment further includes a hand position frame, and the face position information may be a face position frame, based on which the position of the lower frame of the face position frame may be subtracted from the position of the upper frame of the hand position frame to obtain a face position difference.

After the hand position difference and the face position difference are obtained, whether the target object performs the lifting operation is judged according to the hand position difference and the face position difference.

Specifically, the left face position difference obtained by subtracting the upper frame of the left hand position frame from the lower frame of the face position frame may be determined to be a hands-up motion if the left face position difference is smaller than zero, and may be considered to be a hands-up motion if the left face position difference is larger than zero but smaller than a smaller threshold. Accordingly, the same method can be used to determine whether the right hand is a lift motion. Conversely, if the left face position difference is less than or equal to zero and the right face position difference is also less than or equal to zero, this indicates that the operator is not lifting the left hand nor lifting the right hand in the image frame, i.e., the target object is not performing a lift motion.

S508: if yes, determining the image frame as a target image frame;

through the judgment, if the target object makes the hand lifting action, the image frame is the target image frame, otherwise, the image frame is not the target image frame and does not participate in the gesture semantic recognition process.

S510: and carrying out gesture semantic recognition on the target image frame through a semantic recognition neural network to obtain a gesture semantic recognition result.

Fig. 6 is a schematic diagram of a human body structure containing hand-related-point position information and face position information, which further identifies a hand frame and a face frame, namely, frames 8, 9 and 10 in fig. 6, wherein frame 8 represents the face position of the operator, and frame 8 includes an upper frame up, a lower frame down, left and right frames left, up, down, left and right constitute the face position of the operator. Box 9 in fig. 6 represents the left hand position of the operator and box 10 represents the right hand position of the operator.

In some possible embodiments, the step of determining whether the target object performs the hand lifting motion according to the hand position difference and the face position difference may be determining that the target object performs the hand lifting motion if the face position difference is less than a face position difference threshold and if the hand position difference is greater than zero.

Specifically, the hand position difference may be determined first, then the face position difference may be determined, or the hand position difference and the face position difference may be determined simultaneously. For example, referring to fig. 6, it is first determined whether the left palm is higher than the left elbow or whether the right palm is higher than the right elbow, if both palms are lower than the corresponding elbows, it is indicated that the operator does not lift his or her hands, otherwise, if one palms is higher than the corresponding elbows, it is further determined that the relationship between the side hand frame and the face frame. Taking fig. 6 as an example, if the palm on the right side is higher than the elbow on the right side, the relative position relationship between the upper frame of the right hand frame and the lower frame of the face frame is further judged, if the difference between the upper frame of the right hand and the lower frame of the face frame is smaller than the face position difference threshold, the operator is determined to make a right hand lifting action, otherwise, the operator is determined to have no hand lifting action.

It should be noted that, in order to improve the recognition accuracy, a fault tolerance, that is, the above-mentioned face position difference threshold may be set, and the determination accuracy of the face position difference may be controlled by adjusting the face position difference threshold.

In other possible embodiments, the step of determining whether the target object performs the lifting motion according to the hand position difference and the face position difference may further include determining that the target object performs the lifting motion if the face position difference is less than a face position difference threshold or if the hand position difference is greater than zero.

In order to improve the recognition efficiency of the neural network, after the target image is determined, a hand sub-image including the hand of the target object may be determined from the target image frame; and inputting the hand sub-images into a semantic recognition neural network according to the time sequence to obtain a gesture semantic recognition result.

The hand sub-image is determined from the target image frame, the feature extraction can be performed by adopting a neural network, and other image feature extraction algorithms can also be adopted.

It will be appreciated that the gesture made by the operator may be a static gesture, such as a V-shaped gesture, or a dynamic gesture, such as a swipe to the right gesture. In the following, how to perform semantic recognition on a dynamic gesture, fig. 7 is a flowchart of a dynamic gesture semantic recognition method according to an embodiment of the present invention, where the method includes the following steps:

S702: selecting a time segment;

for example, a time slice requiring dynamic semantic control >0.3s, a threshold of the number of frames is set according to the actual frame rate, for example, the number of frames is 20, that is, 20 frames per second are acquired, and then the number of frames output by the time slice is 20×0.3=6 frames, which means that if a gesture continuously appears for more than 6 frames per second at 20 frames per second, the dynamic semantic judgment logic will be included.

S704: a start gesture and an end gesture are selected.

For example, a first appearing ok gesture means start and a second appearing ok gesture means end. The image frames between the beginning and the end will enter the dynamic semantic decision logic.

S706: if a static gesture is recognized between the start gesture and the end gesture and continuously appears in more than six frames, whether the operator makes an action representing the semantic meaning of a dynamic gesture is determined according to the spatial position relation of the static gesture in the data frame sequence where the static gesture appears.

For example, between a start gesture and an end gesture, a static gesture of natural stretching of five fingers is recognized, if this static gesture is detected for six or more consecutive frames, the abscissa x1, x2 … … x6 of its skeletal point (corresponding to point 5 or point 2 in fig. 3 or 6) is further judged, if the abscissa characterizes that the hand is continuously sliding right, for example, x6> =x5 > =...

Since a plurality of operators, i.e., objects, making gestures may be identified during semantic control of the device, in order to obtain an effective gesture, the embodiment of the present invention determines, as a target object, an object that matches with an operator pre-stored in an image acquisition device, where the image acquisition device is a device that acquires an image frame to be identified.

In some possible embodiments, the pre-stored operator may be the operator closest to the image acquisition device.

In other possible embodiments, the pre-stored operator may be a pre-stored face image, and when a plurality of operators are present, the operator having the highest matching degree with the pre-stored face image is determined as the target object.

Based on the above method embodiment, the embodiment of the present invention further provides a gesture semantic recognition apparatus, as shown in fig. 8, where the apparatus includes:

the hand position detection module 802 is configured to detect a hand position of an image frame in the image frame sequence to be identified, so as to obtain a target image frame; the target image frame is an image frame comprising a hand lifting action;

the recognition module 804 is configured to perform gesture semantic recognition on the target image frame through the semantic recognition neural network, so as to obtain a gesture semantic recognition result.

According to the gesture semantic recognition device provided by the embodiment of the invention, the target image frame is obtained by detecting the hand position of the image frame in the image frame sequence to be recognized; and carrying out gesture semantic recognition on the target image frame through a semantic recognition neural network to obtain a gesture semantic recognition result. Through the gesture semantic recognition device, gestures of the target object under the non-real hand-lifting state can be filtered from the image frame sequence to be recognized, the filtered gestures made during the real hand-lifting action are input into the semantic recognition neural network, so that the semantic recognition neural network only recognizes the effective target image, gesture semantic recognition time is reduced, meanwhile, occurrence of false recognition can be reduced, and accuracy of gesture semantic recognition results is effectively improved.

The process of detecting the hand position of the image frame in the image frame sequence to be identified to obtain the target image frame includes: the following operations are respectively carried out on the image frames in the image frame sequence to be identified: acquiring position information of a hand association point of a target object in an image frame; the hand association points comprise palm key points and elbow key points corresponding to the palm key points; judging whether the target object performs a hand lifting action or not according to the position information of the hand association points; if so, the image frame is determined to be the target image frame.

The above-mentioned process for judging whether the target object performs the hand lifting action according to the position information of the hand association point includes: subtracting the position of the elbow key point corresponding to the palm key point from the position of the palm key point to obtain a hand position difference; and if the hand position difference is greater than zero, determining that the target object performs the hand lifting action.

The palm key points comprise a left palm key point and a right palm key point, and the elbow key points comprise a left elbow key point and a right elbow key point; the above-mentioned process of subtracting the position of the elbow key point corresponding to the palm key point from the position of the palm key point to obtain the hand position difference includes: subtracting the position of the right elbow key point from the position of the right palm key point to obtain a right hand position difference; subtracting the position of the left elbow key point from the position of the left palm key point to obtain a left hand position difference; if the hand position difference is greater than zero, determining that the target object performs the hand lifting action, including: and if the right hand position difference or the left hand position difference is larger than zero, determining that the target object performs the hand lifting action.

The above-mentioned process for judging whether the target object performs the hand lifting action according to the position information of the hand association point includes: acquiring face position information of a target object in an image frame; and judging whether the target object performs the hand lifting action or not according to the position information of the hand association points and the face position information.

The position information of the hand association points further comprises a hand position frame, and the face position information is the face position frame; the above-mentioned process for judging whether the target object performs the hand lifting action according to the position information and the face position information of the hand association points includes: subtracting the position of the upper frame of the hand position frame from the position of the lower frame of the face position frame to obtain a face position difference; subtracting the position of the elbow key point corresponding to the palm key point from the position of the palm key point to obtain a hand position difference; and judging whether the target object performs the hand lifting action or not according to the hand position difference and the face position difference.

The above-mentioned process of judging whether the target object carries out the hand lifting action according to the hand position difference and the face position difference includes: if the face position difference is less than the face position difference threshold, and/or if the hand position difference is greater than zero, determining that the target object performs a hand-lifting motion.

The process for performing gesture semantic recognition on the target image frame through the semantic recognition neural network to obtain a gesture semantic recognition result comprises the following steps: determining a hand sub-image containing a hand of the target object from the target image frame; and inputting the hand sub-images into a semantic recognition neural network according to the time sequence to obtain a gesture semantic recognition result.

The target object is an object matched with an operator prestored in an image acquisition device, and the image acquisition device is a device for acquiring an image frame to be identified.

The pre-stored operator is the operator closest to the image acquisition device.

The gesture semantic recognition apparatus provided by the embodiment of the present invention has the same implementation principle and technical effects as those of the foregoing method embodiment, and for brevity description, reference may be made to corresponding contents in the foregoing gesture semantic recognition method embodiment where the foregoing embodiment portion of the apparatus is not mentioned.

The embodiment of the invention further provides an electronic device, as shown in fig. 9, which is a schematic structural diagram of the electronic device, wherein the electronic device includes a processor 901 and a memory 902, the memory 902 stores computer executable instructions that can be executed by the processor 901, and the processor 901 executes the computer executable instructions to implement the gesture semantic recognition method.

In the embodiment shown in fig. 9, the electronic device further comprises a bus 903 and a communication interface 904, wherein the processor 901, the communication interface 904 and the memory 902 are connected by the bus 903.

The memory 902 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. Communication connection between the system network element and at least one other network element is achieved through at least one communication interface 904 (which may be wired or wireless), which may use the internet, a wide area network, a local network, a metropolitan area network, etc. Bus 903 may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The bus 903 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one bi-directional arrow is shown in fig. 9, but not only one bus or one type of bus.

Processor 901 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 901 or instructions in the form of software. The processor 901 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor 901 reads information in the memory, and combines the hardware thereof to complete the steps of the gesture semantic recognition method of the foregoing embodiment.

The embodiment of the invention also provides a computer readable storage medium, which stores computer executable instructions that, when being called and executed by a processor, cause the processor to implement the gesture semantic recognition method, and the implementation of the method is visible in the foregoing embodiment and will not be repeated herein.

The gesture semantic recognition method, the electronic device, the storage medium and the program product provided by the embodiments of the present invention include a computer readable storage medium storing program codes, and the instructions included in the program codes may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be described herein.

The relative steps, numerical expressions and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of gesture semantic recognition, the method comprising:

performing hand position detection on the image frames in the image frame sequence to be identified to obtain target image frames; the target image frame is an image frame comprising a hand lifting action;

and carrying out gesture semantic recognition on the target image frame through a semantic recognition neural network to obtain a gesture semantic recognition result.

2. The method of claim 1, wherein the step of performing hand position detection on the image frames in the sequence of image frames to be identified to obtain the target image frame comprises:

the following operations are respectively carried out on the image frames in the image frame sequence to be identified:

acquiring position information of hand association points of a target object in the image frame; the hand association points comprise palm key points and elbow key points corresponding to the palm key points;

judging whether the target object performs a hand lifting action or not according to the position information of the hand association points;

if so, the image frame is determined to be the target image frame.

3. The method according to claim 2, wherein the step of determining whether the target object performs a hand lifting motion according to the position information of the hand association point comprises:

Subtracting the position of the elbow key point corresponding to the palm key point from the position of the palm key point to obtain a hand position difference;

and if the hand position difference is greater than zero, determining that the target object performs a hand lifting action.

4. The method of claim 3, wherein the palm keypoints comprise left hand palm keypoints and right hand palm keypoints, and the elbow keypoints comprise left hand elbow keypoints and right hand elbow keypoints;

subtracting the position of the elbow key point corresponding to the palm key point from the position of the palm key point to obtain a hand position difference, wherein the step comprises the following steps:

subtracting the position of the right elbow key point from the position of the right palm key point to obtain a right hand position difference;

subtracting the position of the left elbow key point from the position of the left palm key point to obtain a left hand position difference;

if the hand position difference is greater than zero, determining that the target object performs a hand lifting action, including:

and if the right hand position difference or the left hand position difference is larger than zero, determining that the target object performs the hand lifting action.

5. The method according to any one of claims 2-4, wherein the step of determining whether the target object performs a lifting motion according to the position information of the hand-related point includes:

Acquiring face position information of the target object in the image frame;

and judging whether the target object performs a hand lifting action or not according to the position information of the hand association points and the face position information.

6. The method of claim 5, wherein the position information of the hand-associated points further includes a hand position frame, the face position information being a face position frame;

judging whether the target object performs a hand lifting action or not according to the position information of the hand association points and the face position information, wherein the step comprises the following steps:

subtracting the position of the upper frame of the hand position frame from the position of the lower frame of the face position frame to obtain a face position difference;

and judging whether the target object performs a hand lifting action or not according to the hand position difference and the face position difference.

7. The method of claim 6, wherein the step of determining whether the target object is performing a lift motion based on the hand position difference and the face position difference comprises:

and if the face position difference is smaller than the face position difference threshold value, and/or if the hand position difference is larger than zero, determining that the target object performs a hand lifting action.

8. The method according to any one of claims 1-7, wherein the step of performing gesture semantic recognition on the target image frame through a semantic recognition neural network to obtain a gesture semantic recognition result comprises:

determining a hand sub-image containing the hand of the target object from the target image frame;

and inputting the hand sub-images into the semantic recognition neural network according to a time sequence to obtain the gesture semantic recognition result.

9. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the method of any one of claims 1 to 8.

10. A computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any one of claims 1 to 8.

11. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the method of any one of claims 1 to 8.