CN113705511A

CN113705511A - Gesture recognition method and device

Info

Publication number: CN113705511A
Application number: CN202111028923.4A
Authority: CN
Inventors: 关本立; 欧俊文
Original assignee: Ava Electronic Technology Co Ltd
Current assignee: Ava Electronic Technology Co Ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-11-26

Abstract

The invention relates to a gesture recognition method and device, which are used for inputting a hand shooting picture into a convolution feature extraction network after the hand shooting picture is obtained, and obtaining a plurality of convolution features in the convolution feature extraction network. Furthermore, reducing the dimension of each convolution feature to a corresponding dimension vector, splicing the corresponding dimension vectors to obtain a feature vector to be compared, and finally comparing the feature vector to be compared with a preset gesture library to obtain an identification result. The preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions. Based on the method, through a predetermined preset gesture library, the feature vector to be compared is compared with the multi-dimensional feature vector to determine the gesture action corresponding to the hand shooting picture, and gesture recognition is accurately performed. Meanwhile, the detection accuracy of gesture recognition is guaranteed based on the characteristic that the convolutional feature extraction network can be trained in advance.

Description

Gesture recognition method and device

Technical Field

The invention relates to the technical field of image recognition, in particular to a gesture recognition method and device.

Background

Gesture recognition is a subject of computer science and language technology, with the aim of recognizing human gestures by mathematical algorithms. In the implementation process of gesture recognition, images of gestures need to be acquired, hand detection and gesture segmentation analysis are acquired according to the images, and then static or dynamic gesture recognition is performed.

In the video teaching application scene, gestures of teachers or students also need to be detected and recognized. The detection mode mainly comprises the steps of obtaining a static picture of a gesture, carrying out detection and identification, namely, single-stage detection and identification, and positioning and identifying the gesture in a characteristic regression mode; the other type is cascade type detection and identification, a candidate target area is positioned through image information, and then classification is carried out according to the information of the candidate area. However, the single-stage detection and recognition method can only be used for detecting the closed set gesture category, and the detection precision is low, while the cascade detection and recognition method has a large amount of calculation, which results in a slow processing speed.

Therefore, the traditional gesture recognition method based on the static pictures has the defects.

Disclosure of Invention

Therefore, it is necessary to provide a gesture recognition method and device for overcoming the defects of the conventional gesture recognition method based on still images.

A gesture recognition method comprising the steps of:

acquiring a hand shooting picture;

inputting the hand-shot picture into a convolution feature extraction network to obtain a plurality of convolution features in the convolution feature extraction network; wherein, the downsampling multiples corresponding to each convolution characteristic are different;

reducing the dimension of each convolution feature to a corresponding dimension vector respectively, and splicing the corresponding dimension vectors to obtain a feature vector to be compared;

comparing the feature vector to be compared with a preset gesture library to obtain a recognition result; the preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions.

According to the gesture recognition method, after the hand shot picture is obtained, the hand shot picture is input into the convolution feature extraction network, and multiple convolution features in the convolution feature extraction network are obtained. Furthermore, reducing the dimension of each convolution feature to a corresponding dimension vector, splicing the corresponding dimension vectors to obtain a feature vector to be compared, and finally comparing the feature vector to be compared with a preset gesture library to obtain an identification result. The preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions. Based on the method, through a predetermined preset gesture library, the feature vector to be compared is compared with the multi-dimensional feature vector to determine the gesture action corresponding to the hand shooting picture, and gesture recognition is accurately performed. Meanwhile, the pre-training characteristic of the network is extracted based on the convolution characteristic, so that the detection accuracy of gesture recognition is guaranteed, and meanwhile, the calculation amount is reduced conveniently.

In one embodiment, the process of obtaining a hand shot further comprises the steps of:

and carrying out image preprocessing on the hand-shot picture.

In one embodiment, the process of inputting a hand shot into a convolutional feature extraction network comprises the steps of:

outputting the hand shooting picture to a hand detection network to obtain the area frame coordinates containing the hand and the hand classification confidence coefficient in the hand shooting picture;

determining a hand detection area of the hand shot picture according to the area frame coordinates and the hand classification confidence;

and inputting the hand detection area into a convolution feature extraction network.

In one embodiment, the hand detection network includes a convolutional feature extraction sub-network and a multi-size feature fusion sub-network.

In one embodiment, the downsampling multiple comprises an nth power multiple of 2; wherein N is a natural number greater than 1.

In one embodiment, the corresponding dimension comprises a dimension to the power of M of 2; wherein M is a natural number greater than 6.

In one embodiment, the feature vector to be compared is a 256-dimensional vector.

A gesture recognition apparatus comprising:

the picture acquisition module is used for acquiring a hand shooting picture;

the picture transmission module is used for inputting the hand-shot picture into the convolution feature extraction network to obtain a plurality of convolution features in the convolution feature extraction network; wherein, the downsampling multiples corresponding to each convolution characteristic are different;

the vector acquisition module is used for respectively reducing the dimension of each convolution feature to a corresponding dimension vector and splicing the corresponding dimension vectors to obtain a feature vector to be compared;

the result comparison module is used for comparing the feature vector to be compared with a preset gesture library to obtain a recognition result; the preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions.

According to the gesture recognition device, after the hand shooting picture is obtained, the hand shooting picture is input into the convolution feature extraction network, and multiple convolution features in the convolution feature extraction network are obtained. Furthermore, reducing the dimension of each convolution feature to a corresponding dimension vector, splicing the corresponding dimension vectors to obtain a feature vector to be compared, and finally comparing the feature vector to be compared with a preset gesture library to obtain an identification result. The preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions. Based on the method, through a predetermined preset gesture library, the feature vector to be compared is compared with the multi-dimensional feature vector to determine the gesture action corresponding to the hand shooting picture, and gesture recognition is accurately performed. Meanwhile, the detection accuracy of gesture recognition is guaranteed based on the characteristic that the convolutional feature extraction network can be trained in advance.

A computer storage medium having computer instructions stored thereon, the computer instructions when executed by a processor implementing the gesture recognition method of any of the above embodiments.

After the hand shot picture is obtained, the hand shot picture is input into the convolution feature extraction network, and multiple convolution features in the convolution feature extraction network are obtained. Furthermore, reducing the dimension of each convolution feature to a corresponding dimension vector, splicing the corresponding dimension vectors to obtain a feature vector to be compared, and finally comparing the feature vector to be compared with a preset gesture library to obtain an identification result. The preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions. Based on the method, through a predetermined preset gesture library, the feature vector to be compared is compared with the multi-dimensional feature vector to determine the gesture action corresponding to the hand shooting picture, and gesture recognition is accurately performed. Meanwhile, the detection accuracy of gesture recognition is guaranteed based on the characteristic that the convolutional feature extraction network can be trained in advance.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the gesture recognition method of any of the above embodiments when executing the program.

Drawings

FIG. 1 is a flow diagram of a gesture recognition method according to an embodiment;

FIG. 2 is a flow chart of a gesture recognition method according to another embodiment;

FIG. 3 is a schematic diagram of a convolutional feature extraction network structure according to an embodiment;

FIG. 4 is a block diagram of a gesture recognition apparatus according to an embodiment;

FIG. 5 is a schematic diagram of an internal structure of a computer according to an embodiment.

Detailed Description

For better understanding of the objects, technical solutions and effects of the present invention, the present invention will be further explained with reference to the accompanying drawings and examples. Meanwhile, the following described examples are only for explaining the present invention, and are not intended to limit the present invention.

The embodiment of the invention provides a gesture recognition method.

Fig. 1 is a flowchart illustrating a gesture recognition method according to an embodiment, and as shown in fig. 1, the gesture recognition method according to an embodiment includes steps S100 to S103:

s100, acquiring a hand shooting picture;

s101, inputting a hand-shot picture into a convolution feature extraction network to obtain a plurality of convolution features in the convolution feature extraction network; wherein, the downsampling multiples corresponding to each convolution characteristic are different;

s102, reducing the dimension of each convolution feature to a corresponding dimension vector respectively, and splicing the corresponding dimension vectors to obtain a feature vector to be compared;

s103, comparing the feature vector to be compared with a preset gesture library to obtain a recognition result; the preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions.

According to the hand shooting of the shooting object, a hand shooting picture is obtained. In execution, the obtained hand shot picture may be obtained from a shooting device or a storage device or the like, the hand shot picture including the gesture motion.

In one embodiment, fig. 2 is a flowchart of a gesture recognition method according to another embodiment, and as shown in fig. 2, the process of acquiring a hand picture in step S100 further includes step S200:

and S200, carrying out image preprocessing on the hand-shot picture.

By carrying out image preprocessing on the hand-shot picture, the subsequent image feature extraction is facilitated. Image pre-processing includes image cropping, size scaling or filtering noise reduction, etc. In one embodiment, the image preprocessing of the hand-taken picture in step S200 includes resizing the hand-taken picture to a set size. The set size is consistent with the picture size trained in advance by the convolution feature extraction network. As a preferred embodiment, the size of the hand shot is adjusted to a set size of 128 × 128.

In one embodiment, the image pre-processing further comprises data normalization processing.

In step S101, the hand-taken picture is input to a convolution feature extraction network to perform convolution feature extraction. The convolution feature extraction network is trained in advance, and gesture image samples of various gestures prepared in advance are trained. For example, 2 ten thousand gesture image samples are prepared, covering a plurality of human gestures, images are taken under various environments, then the images are classified according to different regions of gesture actions, and each gesture has 50-200 pictures as the gesture image samples. In the training process, a full-connection layer is added behind the network output feature vector, and output dimensionality of the full-connection layer is set according to the classification category number of the data set. The convolution feature extraction network model can be regarded as a classification network, the gesture image samples are input into the model for training, and the weight parameters of each layer of the model are continuously adjusted according to the output result of the model and the labels of the samples, so that the output result of the model is continuously close to the labels of the samples.

After training is finished, removing the last full-connection layer, inputting the hand-shot picture into a convolution feature extraction network, namely completing convolution feature extraction by the convolution feature extraction network and extracting gesture features; further, the vector acquisition of step S102 is completed.

In one embodiment, fig. 3 is a schematic structural diagram of a convolution feature extraction network according to an embodiment, and as shown in fig. 3, the convolution feature extraction network includes a feature extraction backbone network, a spatial attention residual extraction module, and a spatial channel attention module. The feature extraction backbone network is used for extracting convolution features, the spatial attention residual extraction module can improve the dimension of feature extraction, and the spatial channel attention module can adjust the contribution weights of different channels of the convolution features.

In one embodiment, as shown in fig. 2, the process of inputting the hand shot picture into the convolution feature extraction network in step S101 includes steps S201 to S203:

s201, outputting the hand shooting picture to a hand detection network, and obtaining the area frame coordinates containing the hand and the hand classification confidence in the hand shooting picture;

s202, determining a hand detection area of the hand shot picture according to the area frame coordinates and the hand classification confidence;

and S203, inputting the hand detection area into a convolution feature extraction network.

And outputting the hand shooting picture to a hand detection network, and extracting the coordinates of a region frame containing the hand and the hand classification confidence coefficient in the hand shooting picture so as to extract a hand detection region from the hand shooting picture. In one embodiment, the hand detection network includes a convolutional feature extraction sub-network and a multi-size feature fusion sub-network.

The hand detection network integrates convolution characteristics of different sizes, can identify hand positions of different sizes in an input hand shooting picture, detects hand targets of different sizes at any positions in the hand shooting picture, and does not limit the distance between a hand detection area and hand detection.

The output hand classification confidence is used for filtering the areas approximate to the hand features, and the calculation amount of subsequent gesture feature extraction is reduced. Setting a filtering threshold according to the actual use condition, and trying to detect as many hand regions as possible in the image, so that the confidence threshold can be reduced; the confidence threshold may be adjusted high to reduce the identified regions or only the region with the highest confidence may be output.

Similarly, before the hand detection area is input into the convolution feature extraction network, the size of the hand detection area is also adjusted to be the set size. Based on the method, irrelevant background information is removed through extraction of the hand detection area, so that the extracted picture feature vector has stronger representation on hand area data; the sizes are uniformly set, so that the feature vectors with the same dimensionality can be conveniently calculated by a convolution feature extraction network and used for subsequent gesture category comparison.

As shown in fig. 3, in the convolution feature extraction network, convolution features of each layer are extracted based on different downsampling multiples. In one embodiment, the downsampling multiple comprises an nth power multiple of 2; wherein N is a natural number greater than 1. As shown in fig. 3, downsampling multiples of N, i.e., 8 times convolution feature, 16 times convolution feature, and 32 times convolution feature, of 3, 4, and 5, respectively, are selected.

After determining the convolution characteristics of the down-sampling multiples, converting the convolution characteristics of each down-sampling multiple into corresponding dimension vectors. In one embodiment, the corresponding dimension comprises a dimension to the power of M of 2; wherein M is a natural number greater than 6. As shown in FIG. 3, M includes 6 and 7-8 times the convolution features and 16 times the convolution features are converted to 64-dimensional vectors and 32 times the convolution features are converted to 128-dimensional vectors.

And determining the corresponding dimension vectors, and splicing the corresponding dimension vectors to obtain the feature vectors to be compared. As shown in fig. 3, the 64-dimensional vector and the 128-dimensional vector are spliced into a 256-dimensional vector.

After the vector splicing is completed in step S102, the feature vector to be compared is compared with a preset gesture library. The preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions.

And the dimension of the multi-dimensional characteristic vector corresponding to the gesture motion is the same as that of the characteristic vector to be compared. And finishing a preset gesture library by comparing the feature vector to be compared with the multi-dimensional feature vector, and determining the gesture action of the hand-shot picture corresponding to the feature vector to be compared as the recognition result according to the gesture action corresponding to the multi-dimensional feature vector with high similarity to the feature vector to be compared.

The gesture actions and the multidimensional feature vectors of the preset gesture library and the mapping relation of the gesture actions and the multidimensional feature vectors can be determined according to pre-training. In one embodiment, the preset gesture library comprises a gesture library feature matrix, and the gesture library feature matrix is composed of gesture actions and multi-dimensional feature vectors.

The gesture library feature matrix can be configured with a plurality of standard motion pictures with set sizes for each gesture according to a plurality of predefined gesture motions, and the convolution feature extraction is carried out through a convolution feature extraction network and converted into corresponding multidimensional feature vectors to form the gesture library feature matrix.

Based on the above, the receptive fields of the convolution characteristics are distinguished through the convolution characteristics of different downsampling multiples. The characteristic of small receptive field of shallow layer characteristics is utilized for distinguishing small difference of local areas of hand-shot pictures; the deep layer features have large receptive field and are used for distinguishing the difference of the overall gesture outline of the hand shooting picture. The embodiment of the convolutional sign extraction network is that 3 layers of convolutional features with different receptive field sizes are output.

The regions that need to be focused by different gesture actions are different, and for the case that the superficial layer feature receptive field is small, if the focus given by all the regions is the same, the gesture action classification may be not facilitated due to the excessively small difference value. And when shallow features are extracted, a spatial channel attention module is added, and the output proportion weight of the region is readjusted during training, so that the region with large contribution to the class distinction is heavier in weight, and the region with small contribution is smaller in weight.

In step S103, comparing the feature vector to be compared with the feature matrix of the gesture library, including the steps of:

calculating the similarity of the feature vector to be compared and each row of vectors in the feature matrix of the gesture library, and selecting the maximum similarity of the feature vector to be compared and the vectors in the feature matrix of the gesture library and the serial numbers of the rows of the feature vector to be compared; and judging whether the maximum similarity is larger than a set identification threshold value. When the maximum similarity is larger than the recognition threshold, outputting the serial number of the row corresponding to the maximum similarity, and determining the corresponding gesture action according to the serial number; otherwise, outputting the gesture unrecognized result identification.

In the gesture recognition method of any of the embodiments, after the hand-shot picture is obtained, the hand-shot picture is input into the convolution feature extraction network, and multiple convolution features in the convolution feature extraction network are obtained. Furthermore, reducing the dimension of each convolution feature to a corresponding dimension vector, splicing the corresponding dimension vectors to obtain a feature vector to be compared, and finally comparing the feature vector to be compared with a preset gesture library to obtain an identification result. The preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions. Based on the method, through a predetermined preset gesture library, the feature vector to be compared is compared with the multi-dimensional feature vector to determine the gesture action corresponding to the hand shooting picture, and gesture recognition is accurately performed. Meanwhile, the detection accuracy of gesture recognition is guaranteed based on the characteristic that the convolutional feature extraction network can be trained in advance.

The embodiment of the invention also provides a gesture recognition device.

Fig. 4 is a block diagram of a gesture recognition apparatus according to an embodiment, and as shown in fig. 4, the gesture recognition apparatus according to an embodiment includes a block 100, a block 101, a block 102, and a block 103:

a picture acquiring module 100, configured to acquire a hand-shot picture;

the picture transmission module 101 is used for inputting the hand-shot picture into the convolution feature extraction network to obtain a plurality of convolution features in the convolution feature extraction network; wherein, the downsampling multiples corresponding to each convolution characteristic are different;

the vector obtaining module 102 is configured to reduce the dimension of each convolution feature to a corresponding dimension vector, and splice the corresponding dimension vectors to obtain feature vectors to be compared;

the result comparison module 103 is used for comparing the feature vector to be compared with a preset gesture library to obtain a recognition result; the preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions.

The embodiment of the invention also provides a computer storage medium, on which computer instructions are stored, and the instructions are executed by a processor to implement the gesture recognition method of any one of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, the computer program can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a RAM, a ROM, a magnetic or optical disk, or various other media that can store program code.

Corresponding to the computer storage medium, in one embodiment, a computer device is further provided, where the computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement any one of the gesture recognition methods in the embodiments.

The computer device may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a gesture recognition method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

After the hand shooting picture is obtained, the hand shooting picture is input into the convolution feature extraction network, and multiple convolution features in the convolution feature extraction network are obtained. Furthermore, reducing the dimension of each convolution feature to a corresponding dimension vector, splicing the corresponding dimension vectors to obtain a feature vector to be compared, and finally comparing the feature vector to be compared with a preset gesture library to obtain an identification result. The preset gesture library comprises gesture actions and multi-dimensional feature vectors corresponding to the gesture actions. Based on the method, through a predetermined preset gesture library, the feature vector to be compared is compared with the multi-dimensional feature vector to determine the gesture action corresponding to the hand shooting picture, and gesture recognition is accurately performed. Meanwhile, the detection accuracy of gesture recognition is guaranteed based on the characteristic that the convolutional feature extraction network can be trained in advance.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A gesture recognition method, comprising the steps of:

acquiring a hand shooting picture;

inputting the hand-shot picture into a convolution feature extraction network to obtain a plurality of convolution features in the convolution feature extraction network; the downsampling multiples corresponding to the convolution characteristics are different;

2. The gesture recognition method according to claim 1, wherein the process of obtaining a hand shot further comprises the steps of:

and carrying out image preprocessing on the hand shooting picture.

3. The gesture recognition method according to claim 1, wherein the process of inputting the hand shot picture into a convolutional feature extraction network comprises the steps of:

determining a hand detection region of the hand shot picture according to the region frame coordinates and the hand classification confidence;

and inputting the hand detection area into the convolution feature extraction network.

4. The gesture recognition method of claim 3, wherein the hand detection network comprises a convolutional feature extraction sub-network and a multi-dimensional feature fusion sub-network.

5. The gesture recognition method of claim 1, wherein the downsampling multiple comprises an nth power multiple of 2; wherein N is a natural number greater than 1.

6. The gesture recognition method according to claim 1, wherein the corresponding dimension includes an M-th power dimension of 2; wherein M is a natural number greater than 6.

7. The gesture recognition method according to any one of claims 1 to 6, wherein the feature vector to be compared is a 256-dimensional vector.

8. A gesture recognition apparatus, comprising:

the picture acquisition module is used for acquiring a hand shooting picture;

the picture transmission module is used for inputting the hand-shot picture into a convolution feature extraction network to obtain a plurality of convolution features in the convolution feature extraction network; the downsampling multiples corresponding to the convolution characteristics are different;

9. A computer storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement a gesture recognition method according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the gesture recognition method according to any one of claims 1 to 7 when executing the program.