CN114549809A

CN114549809A - Gesture recognition method and related equipment

Info

Publication number: CN114549809A
Application number: CN202210166611.8A
Authority: CN
Inventors: 郝江伟
Original assignee: Shenzhen TCL New Technology Co Ltd
Current assignee: Shenzhen TCL New Technology Co Ltd
Priority date: 2022-02-23
Filing date: 2022-02-23
Publication date: 2022-05-27

Abstract

The application discloses a gesture recognition method and related equipment; the method comprises the steps that an image to be subjected to gesture recognition aiming at a target object can be obtained, wherein the image to be subjected to gesture recognition comprises a visible light image and an infrared light image; carrying out dark light detection on the visible light image to obtain a dark light detection result; selecting a target image from the image to be recognized by the gesture according to the dim light detection result; detecting key body parts of the target image to obtain a key body area of the target image; and performing gesture recognition on the key area of the body to obtain a gesture recognition result of the target object. According to the embodiment of the application, the target image for gesture recognition can be selected according to the dim light detection result, and then gesture recognition is carried out on the target image, so that the detection effect in a dim light environment is improved, and the applicability is enhanced.

Description

Gesture recognition method and related equipment

Technical Field

The application relates to the technical field of computers, in particular to a gesture recognition method and related equipment.

Background

As artificial intelligence technology has been researched and developed, the artificial intelligence technology has been developed and applied in various fields, such as the field of human-computer interaction. Gesture recognition is a key research task in the field of human-computer interaction at present, and has a wide application value. The gesture of the human body in the image to be recognized by the gesture is recognized, so that the behavior of the human body is judged, and the method can be widely applied to various intelligent household devices; in addition, human-computer interaction can be performed through gesture recognition, and various human-computer interaction application programs can be developed.

However, in the related art at present, the gesture recognition algorithm has a poor detection effect in a dim light scene, so that the requirement on light is high, the application of gesture recognition is limited, and the applicability of the gesture recognition algorithm is low.

Disclosure of Invention

The embodiment of the application provides a gesture recognition method and related equipment, wherein the related equipment comprises a gesture recognition device, electronic equipment, a computer readable storage medium and a computer program product, the detection effect in a dim light environment can be improved, and the applicability is enhanced.

The embodiment of the application provides a gesture recognition method, which comprises the following steps:

acquiring an image to be gesture-recognized for a target object, wherein the image to be gesture-recognized comprises a visible light image and an infrared light image;

carrying out dark light detection on the visible light image to obtain a dark light detection result;

selecting a target image from the image to be recognized by the gesture according to the dim light detection result;

detecting key body parts of the target image to obtain a key body area of the target image;

and performing gesture recognition on the key area of the body to obtain a gesture recognition result of the target object.

Correspondingly, the embodiment of the present application provides a gesture recognition apparatus, including:

the gesture recognition device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be gesture recognized aiming at a target object, and the image to be gesture recognized comprises a visible light image and an infrared light image;

the dark light detection unit is used for carrying out dark light detection on the visible light image to obtain a dark light detection result;

the selecting unit is used for selecting a target image from the image to be recognized by the gesture according to the dim light detection result;

the body key part detection unit is used for detecting body key parts of the target image to obtain a body key area of the target image;

and the gesture recognition unit is used for performing gesture recognition on the body key area to obtain a gesture recognition result of the target object.

Optionally, in some embodiments of the present application, the selecting unit may include a first determining subunit and a second determining subunit, as follows:

the first determining subunit is configured to determine, when the dim light detection result indicates that the confidence of the dim light scene is smaller than a preset value, a visible light image in the image to be recognized by the gesture as a target image;

and the second determining subunit is used for determining the infrared light image in the image to be subjected to gesture recognition as the target image when the dark light detection result indicates that the confidence degree of the dark light scene is not less than the preset value.

Optionally, in some embodiments of the present application, the gesture recognition apparatus further includes a preprocessing unit, as follows:

the preprocessing unit is used for carrying out zooming processing on the image to be subjected to gesture recognition to obtain a zoomed image to be subjected to gesture recognition; and normalizing the pixel values of the pixel points in the zoomed image to be recognized by the gesture to obtain the normalized image to be recognized by the gesture.

Optionally, in some embodiments of the present application, the body key part detection unit may include a first extraction subunit, a first upsampling subunit, and a detection subunit, as follows:

the first extraction subunit is configured to extract feature maps under multiple scales from the target image;

the first up-sampling subunit is used for performing up-sampling processing on the feature maps under the multiple scales to obtain a target feature map of the target image;

and the detection subunit is used for detecting the key body part of the target feature map of the target image according to a preset key part template image to obtain the key body area of the target image.

Optionally, in some embodiments of the present application, the gesture recognition unit may include an extension subunit, a second extraction subunit, a second upsampling subunit, and a recognition subunit, as follows:

the expansion subunit is used for performing expansion processing on the key region of the body to obtain an expansion image of the key region;

the second extraction subunit is used for performing multi-scale feature extraction on the key region expansion image to obtain feature maps under multiple scales corresponding to the key region expansion image;

the second up-sampling subunit is used for performing up-sampling processing on the feature maps under the multiple scales to obtain a target feature map of the key area expansion image;

and the recognition subunit is used for performing gesture recognition on the target feature map of the key area expansion image according to a preset gesture template image to obtain a target gesture area and a gesture category corresponding to the target object.

Optionally, in some embodiments of the application, the recognition subunit may be specifically configured to perform gesture recognition on a target feature map of the key region extension image through a sliding preset gesture template image, so as to obtain at least one candidate gesture region corresponding to the target feature map; determining a target gesture area from each candidate gesture area according to the similarity between the preset gesture template image and each candidate gesture area; and determining the gesture category recognized from the target gesture area as the gesture category corresponding to the target object.

The electronic device provided by the embodiment of the application comprises a processor and a memory, wherein the memory stores a plurality of instructions, and the processor loads the instructions to execute the steps in the gesture recognition method provided by the embodiment of the application.

The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the gesture recognition method provided by the embodiment of the present application.

In addition, a computer program product is provided in the embodiments of the present application, and includes a computer program or instructions, and the computer program or instructions, when executed by a processor, implement the steps in the gesture recognition method provided in the embodiments of the present application.

The embodiment of the application provides a gesture recognition method and related equipment, which can acquire an image to be gesture-recognized for a target object, wherein the image to be gesture-recognized comprises a visible light image and an infrared light image; carrying out dark light detection on the visible light image to obtain a dark light detection result; selecting a target image from the image to be recognized by the gesture according to the dim light detection result; detecting key body parts of the target image to obtain a key body area of the target image; and performing gesture recognition on the key area of the body to obtain a gesture recognition result of the target object. According to the embodiment of the application, the target image for gesture recognition can be selected according to the dim light detection result, and then gesture recognition is carried out on the target image, so that the detection effect in a dim light environment is improved, and the applicability is enhanced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a schematic view of a scene of a gesture recognition method provided in an embodiment of the present application;

FIG. 1b is a flowchart of a gesture recognition method provided by an embodiment of the present application;

FIG. 1c is another flowchart of a gesture recognition method provided in an embodiment of the present application;

FIG. 1d is another flowchart of a gesture recognition method provided by an embodiment of the present application;

FIG. 2 is another flow chart of a gesture recognition method provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of a gesture recognition apparatus provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a gesture recognition method and related equipment, and the related equipment can comprise a gesture recognition device, electronic equipment, a computer readable storage medium and a computer program product. The gesture recognition apparatus may be specifically integrated in an electronic device, and the electronic device may be a terminal or a server.

It is understood that the gesture recognition method of the present embodiment may be executed on the terminal, may also be executed on the server, and may also be executed by both the terminal and the server. The above examples should not be construed as limiting the present application.

As shown in fig. 1a, a gesture recognition method is performed by the terminal and the server together. The gesture recognition system provided by the embodiment of the application comprises a terminal 10, a server 11 and the like; the terminal 10 and the server 11 are connected via a network, for example, a wired or wireless network connection, and the like, wherein the gesture recognition device may be integrated in the server.

The server 11 may be configured to: acquiring an image to be gesture-recognized for a target object, wherein the image to be gesture-recognized comprises a visible light image and an infrared light image; carrying out dark light detection on the visible light image to obtain a dark light detection result; selecting a target image from the image to be recognized by the gesture according to the dim light detection result; detecting key body parts of the target image to obtain a key body area of the target image; and performing gesture recognition on the body key area to obtain a gesture recognition result of the target object, and sending the gesture recognition result to the terminal 10. The server 11 may be a single server, or may be a server cluster or a cloud server composed of a plurality of servers.

Wherein, the terminal 10 may be configured to: acquiring an image to be subjected to gesture recognition aiming at a target object, wherein the image to be subjected to gesture recognition comprises a visible light image and an infrared light image; sending the image to be subjected to gesture recognition to the server 11; the gesture recognition result sent by the server 11 may also be received. The terminal 10 may include a mobile phone, a smart television, a tablet Computer, a notebook Computer, a Personal Computer (PC), or the like. A client, which may be an application client or a browser client or the like, may also be provided on the terminal 10.

The step of the server 11 performing gesture recognition may be executed by the terminal 10.

The gesture recognition method provided by the embodiment of the application relates to a computer vision technology in the field of artificial intelligence.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Computer Vision technology (CV) is a science for researching how to make a machine look, and in particular, it refers to a technology for using a camera and a Computer to replace human eyes to perform machine Vision such as identifying, tracking and measuring a target, and further performing image processing, so that the Computer processing becomes an image more suitable for human eyes observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiment will be described from the perspective of a gesture recognition apparatus, which may be specifically integrated in an electronic device, where the electronic device may be a server or a terminal, and the like.

As shown in fig. 1b, the specific flow of the gesture recognition method may be as follows:

101. acquiring an image to be gesture-recognized for a target object, wherein the image to be gesture-recognized comprises a visible light image and an infrared light image.

The target object is an object which needs gesture recognition, and the number of the target objects can be one or more; the image to be gesture-recognized may specifically be an image containing gesture information.

The Infrared (IR) image may be an Infrared image of a bright Infrared image acquired by an Infrared Sensor (Sensor); the visible light image may be a color image of a natural light image acquired by a color Sensor (Sensor), and may specifically be an RGB (Red-Green-Blue) image.

In some embodiments, the image to be gesture-recognized of the target object may be acquired by a camera, which may be a three-dimensional (3D) camera, which may add related hardware and software such as an infrared camera.

In some specific scenes, the gesture of a human body in the acquired video stream needs to be recognized, and the gesture of the human body in each frame of image of the video stream is recognized, so that the behavior of the human body can be judged, and the method can be widely applied to various intelligent household devices; in addition, human-computer interaction can be carried out through human body gesture recognition, and various human-computer interaction application programs can be developed. However, in the related art at present, the detection effect of the human body gesture recognition algorithm is poor in the remote gesture recognition or in the dark scene, and the requirement on light is high, so that the application of the human body gesture recognition is limited. Aiming at the problems, the method can be combined with an infrared image to provide a real-time remote gesture recognition method which is not limited to light rays, and can be used for quickly and accurately recognizing the gestures of multiple persons at an embedded terminal or a server end, so that all-weather and all-scene human-computer interaction is further realized.

102. And carrying out dark light detection on the visible light image to obtain a dark light detection result.

The dark light detection specifically may be to analyze the brightness of the pixel points in the visible light image, for example, the average brightness value of the pixel points in the visible light image may be calculated, and the average brightness value is used as the dark light detection result.

Optionally, in this embodiment, before the step of "performing dark light detection on the visible light image to obtain a dark light detection result", the method may further include:

zooming the image to be subjected to gesture recognition to obtain a zoomed image to be subjected to gesture recognition;

and normalizing the pixel values of the pixel points in the zoomed image to be recognized by the gesture to obtain the normalized image to be recognized by the gesture.

The method comprises the steps that zooming processing is conducted on an image to be subjected to gesture recognition, namely, the size of the image to be subjected to gesture recognition is reduced or enlarged; for example, in this embodiment, the image to be gesture-recognized (specifically, the image may include a visible light image and an infrared light image) may be scaled to a size of 640 × 384 in width.

Optionally, the pixel values of the pixel points in the image to be recognized by the gesture may be normalized first, and then the normalized image to be recognized by the gesture is scaled to obtain the scaled image to be recognized by the gesture, for example, the normalized image to be recognized by the gesture is scaled to 640 pixels in width and 384 pixels in height, and then the scaled image to be recognized by the gesture is subjected to body key part detection.

The gesture recognition image to be processed can be normalized by using preset parameters, the preset parameters can be set according to actual conditions, and specifically, the preset parameters can be used for limiting pixel values of pixel points in the gesture recognition image to be processed within a certain range.

Specifically, the normalization processing of the pixel values of the pixel points in the image to be gesture-recognized can be expressed by the following formula (1):

in this embodiment, min and max are preset parameters in the above embodiment, min is a minimum value of a pixel in an image, and max is a maximum value of the pixel in the image, specifically, in this embodiment, the min value may be 0, and the max value may be 255. x is the number of_iRepresenting pixel values x 'of all pixel points in the image to be subjected to gesture recognition before normalization'_iAnd expressing the pixel value of each pixel point in the normalized image to be recognized by the gesture.

103. And selecting a target image from the image to be recognized by the gesture according to the dim light detection result.

Optionally, in this embodiment, the step "selecting a target image from the image to be gesture-recognized according to the dim light detection result" may include:

when the dim light detection result indicates that the confidence coefficient of the dim light scene is smaller than a preset value, determining the visible light image in the image to be recognized as a target image;

and when the dim light detection result shows that the confidence coefficient of the dim light scene is not smaller than the preset value, determining the infrared light image in the image to be recognized as the target image.

The preset value may be set according to actual conditions, which is not limited in this embodiment. Specifically, the dim light detection result may specifically include a brightness average value of a pixel point in the visible light image, where the brightness average value may be used to represent a confidence of a dim light scene, and the higher the brightness average value is, the lower the confidence of the dim light scene is, and conversely, the lower the brightness average value is, the higher the confidence of the dim light scene is.

In some embodiments, the target image may be a pre-processed image, such as an image that has been subjected to scaling and normalization.

In this embodiment, if the dim light detection result indicates that the scene in which the target object is located is a dim light scene, gesture recognition of the target object needs to be performed based on the infrared light image; if the dim light detection result indicates that the scene where the target object is located is a non-dim light scene, gesture recognition of the target object can be directly performed on the basis of the visible light image without performing gesture recognition on the basis of the infrared light image.

104. And detecting key body parts of the target image to obtain a key body area of the target image.

The body key part detection may be to detect a specific position of a body key part of the target object in the target image, so as to determine a body key area of the target image according to the detected specific position. Specifically, the body key part may be a head and a shoulder of the target object, or an upper body region of the target object, which is not limited in this embodiment and may be set according to an actual situation. It will be appreciated that the body key region comprises a gesture region of the target object.

In some embodiments, the pre-trained neural network model may be used to perform body key region detection on the target image, so as to obtain coordinates of body key regions of all people in the target image. The neural Network model may be a mobile Network v2(MobileNetV2), an open end model (inclusion), an efficiency Network (efficiency Network), a Visual Geometry Network (VGGNet, Visual Geometry Group Network), a Residual Network (Residual Network), and a Dense connection convolution Network (densnet), etc., but it should be understood that the neural Network model of the present embodiment is not limited to only the above listed types.

Wherein, the input of the neural network model may be a target image with a size of 640 a and 384 a.

Optionally, in this embodiment, the step of "performing body key part detection on the target image to obtain a body key region of the target image" may include:

extracting feature maps under a plurality of scales from the target image;

performing up-sampling processing on the feature maps under the multiple scales to obtain a target feature map of the target image;

and detecting key body parts of the target characteristic image of the target image according to a preset key part template image to obtain a key body area of the target image.

The feature maps of the target image under multiple scales are extracted, specifically, the feature maps under multiple scales are extracted by using a convolution part of a neural network model, for example, for MobileNetV2, which includes four residual group structures, feature maps (featuremap) of four sizes can be obtained from the four residual group structures.

In some embodiments, each feature map may be up-sampled by 2 times by using a pyramid feature fusion method, and then sequentially spliced with the corresponding lower group of feature maps according to channels.

Specifically, the step of "performing upsampling processing on the feature maps under the multiple scales to obtain the target feature map of the target image" may include:

performing upsampling processing on the feature map of a target scale to obtain an upsampled fusion feature map of the target image under multiple scales, wherein the upsampling input of each scale is the fusion feature obtained by fusing the upsampled feature map of the adjacent scale and the feature map;

and determining a target feature map of the target image from the up-sampling fusion feature maps of all scales.

For the feature map with low resolution, it can be restored to high resolution by means of upsampling, the essence of upsampling is to enlarge the image and interpolate the image, and the interpolation method can be the nearest neighbor method, the bilinear interpolation method, the cubic convolution interpolation method, and the like.

Wherein the target scale is the lowest scale of the plurality of scales. The step of inputting the upsampling of each scale as the upsampling feature map of the adjacent scale and the fusion feature obtained by fusing the feature maps specifically comprises the following steps: the up-sampling input of each scale is an up-sampling fusion feature map of an adjacent scale, namely the fusion feature obtained by fusing the up-sampling feature map of the adjacent scale and the feature map is an up-sampling fusion feature map of the adjacent scale, wherein the up-sampling feature map of each scale is obtained by performing up-sampling processing on the up-sampling fusion feature map of the adjacent scale.

The fusion refers to feature fusion, and the fusion of features of different scales can improve the characterization capability of the features. The resolution of the low-level features is higher, and the low-level features contain more detailed information, but the low-level features have more noise and low semantic property due to less convolution; the high-level features have strong semantic information, but the resolution is low and the loss of details is large. Fusing multi-layer features, namely fusing multi-scale features, can improve the representation force of the target feature map. There are various fusion methods, for example, the up-sampling feature map and the feature map under the same scale can be spliced; the upsampled feature map and the pixels corresponding to the feature map at the same scale may also be added. It is understood that the manner of fusion is not limited to the above examples, and the present embodiment is not limited thereto.

The step of determining the target feature map of the target image from the upsampled fusion feature maps of the respective scales may include:

and determining the up-sampling fusion feature map with the largest scale as a target feature map of the target image.

Optionally, in this embodiment, the step of performing body key part detection on the target feature map of the target image according to a preset key part template image to obtain a body key region of the target image may include:

detecting body key parts of a target feature map of the target image according to a preset key part template image to obtain at least one candidate body key area corresponding to the target feature map;

and determining the body key area of the target image from the candidate body key areas according to the similarity between each candidate body key area and the preset key part template image.

For example, if the preset key portion template image can identify the head and shoulder area of the target object, it may be regarded as a standard image including head and shoulder information, where the standard image is an image corresponding to a specified head and shoulder area. In some embodiments, the preset key portion template image may be scaled in different scales to obtain preset key portion template images in multiple scales. The scaling scale may be set according to actual conditions, which is not limited in this embodiment.

In some embodiments, the candidate body key region with similarity greater than a preset value with the preset key part template image may be selected as the body key region of the target image, and the preset value may be set according to actual conditions; in other embodiments, the candidate body key regions may also be ranked according to the similarity, for example, ranked from large to small, to obtain ranked candidate body key regions, and then the top N candidate body key regions in the ranked candidate body key regions are used as the body key regions of the target image.

The step of performing body key part detection on a target feature map of the target image according to a preset key part template image to obtain at least one candidate body key region corresponding to the target feature map may specifically include: and sliding on a target feature map of the target image based on the sliding preset key part template image, namely traversing the target feature map to obtain at least one candidate body key area.

Optionally, in this embodiment, the data of the first channel of the feature map may be decoded, all the candidate body key regions that satisfy the condition are subjected to non-maximum suppression, after the non-maximum suppression, the remaining candidate body key regions that satisfy the condition are taken as body key regions of the target image, and coordinates corresponding to the body key regions are output; in addition, the horizontal and vertical coordinates of the body key region can be restored to the size of the corresponding body key region in the original target image according to the corresponding multiple relation between the original size of the target image and the width 640 x the height 384.

In this embodiment, the neural network model for detecting the body key parts may be provided to the gesture recognition device after being trained by other devices, or may be trained by the gesture recognition device itself.

If the gesture recognition device performs the training by itself, the gesture recognition method may further include:

acquiring training data, wherein the training data comprises a sample image and label information of a body key area in the sample image;

extracting feature maps under multiple scales from the sample image through a body key part detection model;

performing up-sampling processing on the feature maps under the multiple scales to obtain a target feature map of the sample image;

detecting key body parts of a target characteristic image of the sample image according to a preset key part template image to obtain an actual key body area of the sample image;

and adjusting parameters of the body key part detection model according to the actual body key area of the sample image and the corresponding label information to obtain the trained body key part detection model.

The sample image may include visible light images and infrared light images corresponding to key parts of a human body in multiple scenes, and the label information of the key areas of the human body in the sample image may specifically refer to labeling information of key parts of the human body of all people in the sample image, such as position labeling information.

Specifically, before the step "extracting feature maps under multiple scales from the sample image through the body key part detection model", the sample image and the corresponding label information may be preprocessed, for example, data enhancement (e.g., Mosica data enhancement), random left-right turning, random angle rotation within a range of left-right 15 degrees, random scaling and clipping are performed on the sample image, and color, brightness, saturation, and contrast of the image are randomly enhanced.

Referring to fig. 1c, in the training process, the sample image and the body key region labeling information of the sample image are obtained first, the body key part detection is performed on the sample image according to a forward propagation algorithm to obtain an actual body key region of the sample image, and whether a loss value between the actual body key region and the expected body key region labeling information meets a training stop condition is calculated, where the training stop condition may be set according to an actual situation, for example, the training may be stopped when the loss value is smaller than a preset value. If the training stopping condition is not met, the parameters of the body key part detection model can be adjusted by using a back propagation algorithm, the parameters of the body key part detection model are optimized according to the loss value between the actual body key area and the expected body key area marking information, so that the loss value between the actual body key area and the expected body key area marking information is smaller than a preset value, the trained body key part detection model is obtained, and the model parameters are stored.

Specifically, in some embodiments, the feature map of the first channel may be extracted according to a plurality of feature maps output by the body key part detection model, a first loss value between the feature map of the first channel and centeramp is calculated, a second loss value between the actual body key region and the expected body key region labeling information is acquired, and the parameters of the body key part detection model are optimized according to the first loss value and the second loss value.

Where centeramp is a gaussian response, if there are multiple people (i.e., multiple target objects) in the sample image, centeramp may be used to indicate the location of the object currently being processed by the body key location detection model.

The target function of the loss value (i.e., the second loss value) corresponding to the position of the bounding box may use GIoU, as shown in equation (2):

wherein IoU represents the ratio of intersection and union of the labeled real box and the predicted box, A_cAnd the minimum closure area of the real box and the prediction box is represented, U represents the intersection of the real box and the prediction box, and GIoU represents the loss value between the real box and the prediction box. The minimum closure area is understood to be the minimum box that contains both the prediction box and the real box. The real box may label information (i.e. the expected body key area) for the expected body key area in the above embodiment, and the prediction box may be the actual body key area in the above embodiment.

The objective function of the loss value can use binary cross entropy BCE (binary cross entropy) to carry out the objective function.

105. And performing gesture recognition on the key area of the body to obtain a gesture recognition result of the target object.

In the embodiment, the body key area is recognized firstly, and then gesture recognition is performed on the basis of the body key area, so that the gesture of the target object can be better detected, and the situation that the gesture cannot be detected due to the fact that the gesture area of the target object in the acquired target image is small in the process of long-distance shooting is avoided.

In some embodiments, pre-trained neural network models may be used to perform gesture recognition on the body key regions. The neural Network model may be a mobile Network v2(MobileNetV2), an open end model (inclusion), an efficiency Network (efficiency Network), a Visual Geometry Network (VGGNet, Visual Geometry Group Network), a Residual Network (Residual Network), and a Dense connection convolution Network (densnet), etc., but it should be understood that the neural Network model of the present embodiment is not limited to only the above listed types.

Optionally, in this embodiment, the step of "performing gesture recognition on the key body region to obtain a gesture recognition result of the target object" may include:

carrying out expansion processing on the key area of the body to obtain a key area expansion image;

performing multi-scale feature extraction on the key region expansion image to obtain feature maps under multiple scales corresponding to the key region expansion image;

performing up-sampling processing on the feature maps under the multiple scales to obtain a target feature map of the key area expansion image;

and performing gesture recognition on the target feature map of the key area expansion image according to a preset gesture template image to obtain a target gesture area and a gesture category corresponding to the target object.

The key area of the body is expanded, and specifically, the key area can be expanded by 1 time up, down, left and right respectively to obtain key area expansion images.

Before multi-scale feature extraction is carried out on the key region expansion image, normalization processing can be carried out on pixel values of pixel points in the key region expansion image, and the normalized key region expansion image is obtained; and then, carrying out scaling processing on the normalized key region expansion image, for example, carrying out width scaling on the key region expansion image to 256 pixels, carrying out height scaling on the key region expansion image to 256 pixels, and then carrying out gesture recognition on the scaled key region expansion image.

For example, for MobileNetV2, which includes four residual group structures, feature maps (feature maps) of four sizes can be obtained from the four residual group structures.

Specifically, the step of "performing upsampling processing on the feature maps under the multiple scales to obtain the target feature map of the key region extension image" may include:

performing upsampling processing on the feature map of a target scale to obtain an upsampled fusion feature map of the key region extended image under multiple scales, wherein the upsampling input of each scale is the fusion feature obtained by fusing the upsampled feature map of the adjacent scale and the feature map;

and determining a target feature map of the key region expansion image from the up-sampling fusion feature maps of all scales.

The fusion mode is various, for example, the up-sampling feature map and the feature map under the same scale can be spliced; the upsampled feature map and the pixels corresponding to the feature map at the same scale may also be added. It is understood that the manner of fusion is not limited to the above examples, and the present embodiment is not limited thereto.

The step of determining the target feature map of the key region expansion image from the upsampled fusion feature maps of the respective scales may include:

and determining the up-sampling fusion feature map with the largest scale as a target feature map of the key region expansion image.

Optionally, in this embodiment, the step of performing gesture recognition on the target feature map of the key region expansion image according to a preset gesture template image to obtain a target gesture region and a gesture category corresponding to the target object may include:

performing gesture recognition on a target feature map of the key region expansion image through a sliding preset gesture template image to obtain at least one candidate gesture region corresponding to the target feature map;

determining a target gesture area from each candidate gesture area according to the similarity between the preset gesture template image and each candidate gesture area;

and determining the gesture category recognized from the target gesture area as the gesture category corresponding to the target object.

The preset gesture template image can be used for identifying a gesture area in the target feature map, and can be regarded as a standard image containing gesture information, wherein the standard image is an image corresponding to a specified gesture area. In some embodiments, the preset gesture template image may be scaled in different scales to obtain preset gesture template images in multiple scales. The scaling scale may be set according to actual conditions, which is not limited in this embodiment.

In some embodiments, a candidate gesture area with a similarity greater than a preset value with a preset gesture template image may be selected as a target gesture area of a target feature map, and the preset value may be set according to an actual situation; in other embodiments, the candidate gesture regions may also be ranked according to the similarity, for example, ranked from large to small, to obtain ranked candidate gesture regions, and then the top N candidate gesture regions in the ranked candidate gesture regions are used as target gesture regions of the target feature map.

The step of performing gesture recognition on the target feature map of the key region extension image through a sliding preset gesture template image to obtain at least one candidate gesture region corresponding to the target feature map may specifically include: and sliding on a target feature map of the key region expansion image based on the sliding preset gesture template image, namely traversing the target feature map to obtain at least one candidate gesture region.

Optionally, in this embodiment, the data of the first channel of the feature map may be decoded, all candidate gesture regions that satisfy the condition are subjected to non-maximum suppression, after the non-maximum suppression, the remaining candidate gesture regions that satisfy the condition are taken as target gesture regions of the target image, and coordinates corresponding to the target gesture regions are output; in addition, the horizontal and vertical coordinates of the target gesture area can be restored to the size of the corresponding gesture area in the original target image according to the corresponding multiple relation between the original size of the target image and the width 256 and the height 256.

In this embodiment, the neural network model for gesture recognition may be provided to the gesture recognition apparatus after being trained by other devices, or may be trained by the gesture recognition apparatus itself.

acquiring training data, wherein the training data comprises a sample image and label information of a gesture area in the sample image;

extracting feature maps under multiple scales from the sample image through a gesture recognition model;

performing gesture recognition on a target feature map of the sample image according to a preset gesture template image to obtain an actual gesture area of the sample image;

and adjusting parameters of the gesture recognition model according to the actual gesture area of the sample image and the corresponding label information to obtain the trained gesture recognition model.

The sample image may include a visible light image and an infrared light image containing human body gestures in multiple scenes, and the label information of the gesture area in the sample image may specifically refer to the gesture frame labeling information of all people in the sample image, such as position labeling information.

Specifically, before the step "extracting feature maps under multiple scales from the sample image through a gesture recognition model", the sample image and the corresponding label information may be preprocessed, for example, data enhancement (e.g., Mosica data enhancement), random left-right turning, random angle rotation within a range of left and right 15 degrees, random scaling and clipping are performed on the sample image, and color, brightness, saturation, and contrast of the image are randomly enhanced, which is not limited in this embodiment, and finally, the preprocessed sample image is scaled to 256 × 256 pixels.

The training process comprises the steps of firstly calculating an actual gesture area of a sample image, then adjusting parameters of a gesture recognition model by using a back propagation algorithm, and optimizing the parameters of the gesture recognition model according to a loss value between the actual gesture area and expected gesture frame annotation information to enable the loss value between the actual gesture area and the expected gesture frame annotation information to be smaller than a preset value, so that the trained gesture recognition model is obtained.

Specifically, in some embodiments, the feature map of the first channel may be extracted according to a plurality of feature maps output by the gesture recognition model, a first loss value between the feature map of the first channel and centerap is calculated, a second loss value between an actual gesture area and expected gesture frame annotation information is obtained, and a parameter of the gesture recognition model is optimized according to the first loss value and the second loss value.

Where centeramp is a gaussian response, centeramp may be used to indicate the location of the object currently being processed by the gesture recognition model if there are multiple people (i.e., multiple target objects) in the sample image.

The target function of the penalty value (i.e., the second penalty value) corresponding to the bounding box position may use GIoU, as shown in equation (3):

wherein IoU represents the ratio of intersection and union of the labeled real box and the predicted box, A_cAnd the minimum closure area of the real box and the prediction box is represented, U represents the intersection of the real box and the prediction box, and GIoU represents the loss value between the real box and the prediction box. The minimum closure area is understood to be the minimum box that contains both the prediction box and the real box. The real box may label information (i.e., a desired gesture area) for the desired gesture box in the above-described embodiment, and the predicted box may be an actual gesture area in the above-described embodiment.

The objective function of the loss value may use binary cross soil moisture bce (binary cross entropy).

Optionally, the present application may provide a remote gesture recognition system based on a visible light image and an infrared light image, which may include: the device comprises an image acquisition module, an image preprocessing module, a dim light judging module, a head and shoulder module and a gesture recognition module, and is specifically described as follows:

the image acquisition module is used for acquiring images (including visible light images and infrared light images) to be recognized by gestures, which are acquired by the camera;

the image preprocessing module is used for carrying out preprocessing operations such as normalization, scaling and the like on the image acquired by the camera;

the dark light judgment module is used for inputting the preprocessed visible light image into a dark light detection algorithm, giving confidence coefficient of a dark light environment after judgment, and determining whether to use the infrared light image for gesture recognition or the visible light image for gesture recognition according to the confidence coefficient;

the head and shoulder module is used for inputting the selected target image into a pre-trained neural network model and predicting the coordinates of the head and shoulder frames of all people in the target image through a post-processing algorithm matched with the pre-trained model;

and the gesture recognition module is used for inputting the head and shoulder area of the target image into the pre-trained neural network model and predicting the coordinates and the categories of all gesture boxes in the head and shoulder area through a post-processing algorithm matched with the pre-trained model.

The method and the device can judge whether the current scene is a dim light scene or not through the dim light judging module, and can perform gesture recognition based on the visible light image if the current scene is not a dim light scene; and if the scene is a dim light scene, switching to an IR gesture recognition algorithm, namely, performing gesture recognition based on the infrared light image. Can solve the shortcoming that present industry boundary gesture algorithm can't use under the dim light environment of light range, this application can be according to dim light judgement module, detects light environment, and automatic switch is based on RGB gesture recognition algorithm and IR gesture recognition algorithm, consequently can be in the steady operation under the dim light scene of complicacy.

The method can detect the body key area through a body key part detection algorithm, and the side length of the body key area is enlarged by 1 time from top to bottom, left to right, and the key area expansion image is obtained; and preprocessing the key region expansion image, and taking a preprocessing result as an input picture of a gesture recognition algorithm. Compared with the defect that the accuracy and the detection rate are low due to the fact that the occupation ratio of a hand target in a full image is too small in the existing commonly-used gesture recognition algorithm, the processing logic of the method greatly improves the occupation ratio of the hand target in an input image, the accuracy and the detection rate are directly improved, the detection distance can reach 6 meters, the deducted key area expansion image is far smaller than the full image, the key area expansion image is used as the input of the gesture recognition algorithm, and therefore the calculation force requirement is smaller and the speed is higher.

The remote gesture recognition method and the remote gesture recognition system can perform remote gesture recognition based on the RGB full map and the IR full map, can predict the gesture category of each person in a picture by using a gesture recognition algorithm based on deep learning, and can perform fast and accurate stable operation at an embedded terminal or a server, so that further behavior recognition or man-machine interaction is performed, and different requirements under scenes such as intelligent household appliances and game interaction are met.

In a specific scene, as shown in fig. 1d, a visible light image and an infrared light image of a current frame (an image to be gesture-recognized) are collected by a camera; preprocessing the image to be recognized by the gesture, such as normalization, scaling and the like; detecting the confidence coefficient that the current scene belongs to the dim light scene according to a dim light detection algorithm, and if the light is good and the current scene is a non-dim light scene, taking the visible light image as a target image; and if the scene is a dim light scene, taking the infrared light image as a target image. Detecting key body parts of the preprocessed target image to obtain key body areas; and (3) respectively extending the key area (specifically, a head and shoulder frame) of the body up, down, left and right by 1 time of the frame height and the frame width, performing frame expansion, and deducting a sub-image on an original image (a target image) by using the coordinates of the expansion frame to obtain an expansion image of the key area, wherein the expansion image is used as an input image of a gesture recognition algorithm. Wherein the number of body key regions determines the number of times the gesture recognition algorithm is invoked.

For each body key area, a corresponding key area expansion image can be acquired, and gesture recognition is performed on each key area expansion image to obtain a corresponding gesture area (namely a gesture frame). Merging all detected gesture frame results, and filtering out gesture frames which are repeatedly detected due to overlapping of key region expansion images; and taking the filtered gesture box as a gesture recognition result of the current frame.

As can be seen from the above, the present embodiment may acquire an image to be gesture-recognized for a target object, where the image to be gesture-recognized includes a visible light image and an infrared light image; carrying out dark light detection on the visible light image to obtain a dark light detection result; selecting a target image from the image to be recognized by the gesture according to the dim light detection result; detecting key body parts of the target image to obtain a key body area of the target image; and performing gesture recognition on the key area of the body to obtain a gesture recognition result of the target object. According to the method and the device, the target image for gesture recognition can be selected according to the dim light detection result, and then gesture recognition is carried out on the target image, so that the detection effect in a dim light environment is improved, and the applicability is enhanced.

The method described in the foregoing embodiment will be described in further detail below by way of example in which the gesture recognition apparatus is specifically integrated in a server.

The embodiment of the application provides a gesture recognition method, as shown in fig. 2, a specific process of the gesture recognition method may be as follows:

201. the method comprises the steps that a server obtains an image to be subjected to gesture recognition aiming at a target object, wherein the image to be subjected to gesture recognition comprises a visible light image and an infrared light image.

In some specific scenes, the gesture of a human body in the acquired video stream needs to be recognized, and the gesture of the human body in each frame of image of the video stream is recognized, so that the behavior of the human body can be judged, and the method can be widely applied to various intelligent household devices; in addition, human-computer interaction can be carried out through human body gesture recognition, and various human-computer interaction application programs can be developed. However, in the related art at present, for the remote gesture recognition or in a dark scene, the human gesture recognition algorithm has a poor detection effect, and the requirement on light is high, which limits the application of human gesture recognition. Aiming at the problems, the method can be combined with an infrared image to provide a real-time remote gesture recognition method which is not limited to light rays, and can be used for quickly and accurately recognizing the gestures of multiple persons at an embedded terminal or a server end, so that all-weather and all-scene human-computer interaction is further realized.

202. And the server performs dark light detection on the visible light image to obtain a dark light detection result.

203. When the dim light detection result shows that the confidence coefficient of the dim light scene is smaller than a preset value, the server determines the visible light image in the image to be recognized as a target image; and when the dim light detection result shows that the confidence coefficient of the dim light scene is not smaller than the preset value, the server determines the infrared light image in the image to be recognized as the target image.

204. And the server detects key body parts of the target image to obtain a key body area of the target image.

extracting feature maps under a plurality of scales from the target image;

205. And the server performs gesture recognition on the key area of the body to obtain a gesture recognition result of the target object.

performing multi-scale feature extraction on the key area expansion image to obtain feature maps under multiple scales corresponding to the key area expansion image;

As can be seen from the above, in the embodiment, the image to be gesture-recognized for the target object may be acquired by the server, where the image to be gesture-recognized includes a visible light image and an infrared light image; carrying out dark light detection on the visible light image to obtain a dark light detection result; when the dim light detection result shows that the confidence coefficient of the dim light scene is smaller than a preset value, the server determines the visible light image in the image to be recognized as a target image; when the dim light detection result shows that the confidence coefficient of the dim light scene is not smaller than the preset value, the server determines the infrared light image in the image to be recognized as a target image; detecting key body parts of the target image to obtain a key body area of the target image; and performing gesture recognition on the key area of the body to obtain a gesture recognition result of the target object. According to the embodiment of the application, the target image for gesture recognition can be selected according to the dim light detection result, and then gesture recognition is carried out on the target image, so that the detection effect in a dim light environment is improved, and the applicability is enhanced.

In order to better implement the above method, an embodiment of the present application further provides a gesture recognition apparatus, as shown in fig. 3, the gesture recognition apparatus may include an obtaining unit 301, a dim light detecting unit 302, a selecting unit 303, a body key part detecting unit 304, and a gesture recognition unit 305, as follows:

(1) an acquisition unit 301;

the gesture recognition device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be gesture recognized aiming at a target object, and the image to be gesture recognized comprises a visible light image and an infrared light image.

(2) A dark light detection unit 302;

and the dark light detection unit is used for carrying out dark light detection on the visible light image to obtain a dark light detection result.

(3) A selecting unit 303;

and the selecting unit is used for selecting a target image from the image to be recognized by the gesture according to the dim light detection result.

(4) A body critical part detection unit 304;

and the body key part detection unit is used for detecting the body key parts of the target image to obtain the body key area of the target image.

the first up-sampling subunit is configured to perform up-sampling processing on the feature maps at the multiple scales to obtain a target feature map of the target image;

(5) A gesture recognition unit 305;

As can be seen from the above, the present embodiment may acquire an image to be gesture-recognized for a target object by the acquisition unit 301, where the image to be gesture-recognized includes a visible light image and an infrared light image; performing dark light detection on the visible light image through a dark light detection unit 302 to obtain a dark light detection result; selecting a target image from the image to be recognized by the gesture according to the dim light detection result through a selecting unit 303; detecting key body parts of the target image by using a key body part detection unit 304 to obtain a key body area of the target image; gesture recognition is performed on the body key area through the gesture recognition unit 305, and a gesture recognition result of the target object is obtained. According to the embodiment of the application, the target image for gesture recognition can be selected according to the dim light detection result, and then gesture recognition is carried out on the target image, so that the detection effect in a dim light environment is improved, and the applicability is enhanced.

An electronic device according to an embodiment of the present application is further provided, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the present application, where the electronic device may be a terminal or a server, and specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

acquiring an image to be gesture-recognized for a target object, wherein the image to be gesture-recognized comprises a visible light image and an infrared light image; carrying out dark light detection on the visible light image to obtain a dark light detection result; selecting a target image from the image to be recognized by the gesture according to the dim light detection result; detecting key body parts of the target image to obtain a key body area of the target image; and performing gesture recognition on the key area of the body to obtain a gesture recognition result of the target object.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, the present embodiment may acquire an image to be gesture-recognized for a target object, where the image to be gesture-recognized includes a visible light image and an infrared light image; carrying out dark light detection on the visible light image to obtain a dark light detection result; selecting a target image from the image to be recognized by the gesture according to the dim light detection result; detecting key body parts of the target image to obtain a key body area of the target image; and performing gesture recognition on the key area of the body to obtain a gesture recognition result of the target object. According to the embodiment of the application, the target image for gesture recognition can be selected according to the dim light detection result, and then gesture recognition is carried out on the target image, so that the detection effect in a dim light environment is improved, and the applicability is enhanced.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer-readable storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the gesture recognition methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium may execute the steps in any gesture recognition method provided in the embodiments of the present application, beneficial effects that can be achieved by any gesture recognition method provided in the embodiments of the present application may be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations of the gesture recognition aspect described above.

The gesture recognition method and the related device provided by the embodiment of the present application are described in detail above, a specific example is applied in the description to explain the principle and the embodiment of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A gesture recognition method, comprising:

2. The method according to claim 1, wherein the selecting a target image from the image to be gesture-recognized according to the dim light detection result comprises:

3. The method of claim 1, wherein before performing the dark light detection on the visible light image to obtain the dark light detection result, the method further comprises:

4. The method according to claim 1, wherein the performing body key part detection on the target image to obtain body key areas of the target image comprises:

extracting feature maps under a plurality of scales from the target image;

5. The method according to claim 1, wherein the gesture recognition of the body key region to obtain the gesture recognition result of the target object comprises:

6. The method according to claim 5, wherein the performing gesture recognition on the target feature map of the key region expansion image according to a preset gesture template image to obtain a target gesture region and a gesture category corresponding to the target object includes:

7. A gesture recognition apparatus, comprising:

8. An electronic device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the operations of the gesture recognition method according to any one of claims 1 to 6.

9. A computer readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the gesture recognition method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program or instructions, characterized in that the computer program or instructions, when executed by a processor, performs the steps in the gesture recognition method according to any one of claims 1 to 6.