CN115188080A

CN115188080A - Traffic police gesture recognition method and system based on skeleton recognition and gated loop network

Info

Publication number: CN115188080A
Application number: CN202210884263.8A
Authority: CN
Inventors: 铁治欣; 王登文; 陈燕兵; 陶灵兵
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-10-14

Abstract

The invention discloses a method and a system for recognizing a traffic police gesture based on skeleton recognition and a gated loop network, and relates to the technical field of image recognition. The method comprises the following specific steps: acquiring a traffic police command gesture image; performing joint point extraction on the traffic police command gesture image by using an OpenPose algorithm to obtain joint point data; screening key point data which has large influence on the gesture recognition of the traffic police in the joint point data, and taking the key point data as a training set; and inputting the training set into a GRU network to extract the time sequence characteristics of the traffic police command gestures. The method for recognizing the traffic police command gesture provided by the invention can be well suitable for recognizing the traffic police gesture under the complex background, the recognition rate of the system reaches 91.51%, and the method has excellent real-time performance.

Description

Traffic police gesture recognition method and system based on skeleton recognition and gated loop network

Technical Field

The invention relates to the technical field of image recognition, in particular to a method and a system for recognizing a traffic police gesture based on skeleton recognition and a gated cyclic network.

Background

With the gradual development of the automatic driving technology, the related problems in the sub-fields are more and more concerned by researchers, and the automatic recognition of the command gestures of the traffic police for automatic driving is less researched. In daily, under some congested highway sections or the scene that traffic lights are malfunctioning, the traffic police can control the traffic through the mode of gesture commander, however when the acceptor of receiving the gesture is autopilot, then need the relatively accurate these gestures of discernment in the limited time of its ability, among the prior art, carry out gesture recognition to the traffic police gesture mainly has two kinds: body sensor-based approaches and vision-based gesture recognition approaches; the method comprises the steps that a body sensor is used for recognizing traffic police gestures, so that the recognition precision and accuracy are improved to a certain extent, but the operation conditions are not favorable for practical implementation; the gesture recognition scheme based on vision solves the problem that equipment based on a body sensor is complex, but the accuracy and the real-time performance of the gesture recognition scheme cannot meet the application requirements in a real scene. Therefore, it is one of the key problems for those skilled in the art to efficiently and accurately recognize traffic guidance gestures to implement automatic driving.

Disclosure of Invention

In view of this, the present invention provides a method and a system for recognizing a traffic police gesture based on a skeleton recognition and a gated loop network, so as to solve the problems in the background art.

In order to achieve the purpose, the invention adopts the following technical scheme: a traffic police gesture recognition method based on skeleton recognition and a gated loop network comprises the following specific steps:

acquiring a traffic police command gesture image;

performing joint point extraction on the traffic police command gesture image by using an OpenPose algorithm to obtain joint point data;

screening key point data which has large influence on the gesture recognition of the traffic police in the joint point data, and taking the key point data as a training set;

and inputting the training set into a GRU network to extract the time sequence characteristics of the traffic police command gestures.

Optionally, the step of obtaining the traffic police command gesture image is as follows: collecting video clips of traffic police command gestures under different visual angles, different scenes and different speeds; and dividing the video segments into continuous image sequences to further obtain the traffic police command gesture images.

Optionally, the step of performing joint point extraction on the traffic police command gesture image by using an openpos algorithm includes: the method comprises the steps that image features of traffic police command gesture images are extracted through a backbone network VGG19 to obtain first features, the first features are input to stage modules, the stages are serial modules, the structure and the function of each stage module are identical, each stage module is divided into two branches, the first branch generates Paf and is subjected to Loss solution, the second branch generates Pcm and is subjected to Loss solution, and the Loss solutions of the first branch and the second branch are added to obtain the final Loss.

Optionally, the method further includes delaying the time of the effective gesture marking in the process of data marking of the traffic police command gesture image.

On the other hand, the system for recognizing the traffic police gestures based on the skeleton recognition and the gated loop network comprises a data acquisition module, a feature extraction module, a screening module and a gesture recognition module; wherein,

the data acquisition module is used for acquiring a traffic police command gesture image;

the feature extraction module is used for extracting joint points of the traffic police command gesture image by utilizing an OpenPose algorithm to obtain joint point data;

the screening module is used for screening key point data which has a large influence on the recognition of the traffic police gesture from the joint point data and taking the key point data as a training set;

and the gesture recognition module is used for inputting the training set into the GRU network to extract the time sequence characteristics of the traffic police command gesture.

Optionally, the system further comprises a data preprocessing module, configured to perform time delay processing during the process of performing data marking on the traffic police command gesture image.

According to the technical scheme, compared with the prior art, the method and the system for recognizing the traffic police gesture based on the skeleton recognition and the gated loop network are disclosed, openpos is used for extracting joint points of the traffic police command gesture, the extracted joint points are appropriately screened, and then screened joint point position information is transmitted into the gated loop network (GRU) to recognize the traffic police gesture.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a diagram of an OpenPose network architecture of the present invention;

FIG. 3 is an internal block diagram of a GRU of the present invention;

FIG. 4 is a schematic view of a photographing method and a view angle according to the present invention;

fig. 5 is a system configuration diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a traffic police gesture recognition method based on skeleton recognition and a gated loop network, which comprises the following specific steps as shown in figure 1:

s1, acquiring a traffic police command gesture image;

s2, joint point extraction is carried out on the traffic police command gesture image by utilizing an OpenPose algorithm to obtain joint point data;

s3, screening key point data which has a large influence on the gesture recognition of the traffic police in the joint point data, and taking the key point data as a training set;

and S4, inputting the training set into the GRU network to extract the time sequence characteristics of the traffic police command gestures.

Further, the step of obtaining the traffic police command gesture image is as follows: collecting video clips of traffic police command gestures at different visual angles, different scenes and different speeds; and dividing the video segment into continuous image sequences to further obtain the traffic police command gesture image. In the embodiment, 20 segments of videos containing 8 gestures are recorded according to the Chinese traffic guidance gesture specification, the videos are divided into continuous image sequences, skeleton information in the images, including position information of key nodes of 25 human bodies, is extracted by using OpenPose, coordinates of 14 joint points which have large influence on the traffic guidance gesture are screened out from the 25 joint points by the system to serve as input, and training and testing of a GRU model are respectively carried out.

Further, a specific way of screening the key point data having a large influence on the gesture recognition of the traffic police in the joint point data is as follows: regarding the screening of the joint points, the human body key points extracted by using OpenPose of the invention comprise 18 pieces of joint point information, and the information is divided into key point position information and key point confidence information; in view of the recognition of the traffic police command gesture, the influence of the facial information on the gesture judgment result can be almost ignored in the traffic police command scene, so that four pieces of key point information of the face with small influence on the traffic police command gesture can be selected and removed from the original 18 key points, and only the remaining joint point information after screening is used as the information input into the network for training, thereby not only reducing the burden of data extraction, but also improving the efficiency of system training.

Further, openpos is a convolutional neural network based on the Caffe framework, is mainly suitable for estimation of human motion gestures, facial expressions and gesture motions, has good recognition effect and recognition speed, and particularly has good recognition rate in some complex scenes. The structure of the openpos network is shown in fig. 2.

As can be seen from fig. 2, the image features of the traffic police command gesture image are extracted through a backbone network VGG19 to obtain first features, the first features are input to the stages, the stages are serial modules, the structure and function of each stage module are the same, each stage module is divided into two branches, the first branch generates a Paf, and the Loss solution is performed, the second branch generates a Pcm, and the Loss solution is performed, and the final Loss is obtained by adding the Loss solutions of the first branch and the second branch.

In the embodiment, the four joint points with small influence on the traffic police command gesture, namely 8-11 joint points, are removed, and only 14 key points with large influence on the traffic police gesture recognition are selected as training and testing of the model, so that the accuracy and the robustness of the model can be improved.

And in the process of carrying out data marking on the traffic police command gesture image, carrying out time delay on the time marked by the effective gesture.

Furthermore, as the gated cyclic network (GRU) has the characteristics of high convergence rate and high identification accuracy, the invention selects the GRU as a main classification network, the specific network structure of the GRU is shown in fig. 3, the GRU network training is carried out on the basis of taking the existing network, namely the gated cyclic network GRU, as a main network, and the invention is additionally designed on the basis, for example: in order to increase the complexity of the incoming data, two random sensors with the same size as the key information are added; furthermore, dropout processing is carried out on the fully connected characteristic information, so that the risk of overfitting of the system can be reduced. GRUs and LSTM are similar but have one less gate-control in them than LSTM, and have correspondingly fewer parameters than LSTM. The system is not a directly trained end-to-end network, and the training process comprises the selection and pretreatment of a data set; extracting human body key point information from the data extracted and processed by OpenPose, and training by only selecting 14 joint points which have large influence on the traffic police command gesture; and finally, transmitting the extracted 14 joint point information into a GRU for training, wherein the training process is shown as an algorithm 1:

furthermore, in order to verify the final effect of the invention in the gesture recognition of the traffic police, experiments in two directions are mainly set for testing the gesture recognition of the traffic police, including an off-line distance editing experiment for testing the accuracy of the gesture recognition of the traffic police, and a model training convergence speed comparison experiment for testing the convergence speed of each network training. Therefore, the performance and the applicability of each model to the recognition of the traffic police command gestures can be comprehensively obtained.

Data set acquisition: in a conventional environment, 21 videos including different objects are shot by a microsoft Kinect camera at a speed of 15 frames per second at a front view angle, each video has 8 common traffic guidance gestures in china, the shooting mode is as shown in fig. 4, the height of the device is approximately at the left and right positions of the waist of a human body, the size of the image is processed to 512 × 512, and 132831 frames of the image are collected.

The raw data set contains a gesture image and a standby image for each video, and the data set is composed of 3350 active gestures and 1690 standby gestures as shown in table 1. In order to make the experiment more generalized, a data partitioning method of different original data sets is used, and data partitioning is performed based on different shooting scenes and application scenes, as shown in table 2, wherein the data is partitioned into a training set, a verification set and a test set.

TABLE 1

	Active gestures	Standby gesture	Number of videos
				Training set	1565	787	10
Test set	1789	900	11
				Sum of	3354	1687	21

TABLE 2

	Training set	Verification set	Test set	Sum of
					Video	11	3	7	21
Number of gestures	1757	479	1118	3354
					Number of frames	69579	18975	44277	132831

Data time delay processing: in the process of data marking, the effective gesture marking time in the original traffic police gesture data set is delayed for 750ms backwards on the basis of the original gesture, after 10 frames of labels with annotations cut off from the data set are added with the delay, the range of the number of invalid frames is prolonged, namely a gray area is lengthened, which means that the appearance of the effective gesture labels is delayed; if the delay processing is not performed, the network trained by using the original data is too sensitive to each action of the traffic police, and the phenomenon that the effective gesture does not freeze and results appear can occur, so that some wrong predictions can often occur in the system. The experimental result proves that delaying effective labels can obviously improve the accuracy and stability of the network, so that the system has stronger robustness to the actions related to 8 traffic police gestures.

Edit distance experiment: to compare the differences between the two sequences, an edit distance evaluation criterion was introduced in this example. The editing distance is commonly used for measuring the similarity of two character strings, in the NLP field, the editing distance is used as a semantic similarity measurement standard, the method is used for evaluating how many times the results of all test frames of a video need to be changed to be identical to the real results, and the changes comprise deletion, insertion and replacement. Herein suppose T _i For the total number of frames in video i, D _i Is the total number of deleted gestures in video I, I _i For the total number of intervening postures, R _i To replace the total number of gestures, therefore, the recognition accuracy for a segment of video is expressed as follows:

in the experiment, all tests were performed under the same environment in order to ensure objectivity and authenticity of the comparison data. The invention predicts the test set data by using the trained GRU model, and finally performs the edit distance experimental analysis, and the result is shown in Table 3.

TABLE 3

The result shows that the accuracy of the GRU classification network on the same video prediction result is superior to that of the LSTM classification network on the whole under the same environment and the same data set. For data sets under different scenes, the OpenPose + GRU identification mode can still show good identification effect under extreme environments such as weak illumination and complex background conditions, but the identification effect of the system is poor when the system is interfered by pedestrians.

Comparative analysis with other algorithms: in order to compare the training efficiency of each classification network after the OpenPose posture is extracted, each network convergence speed test experiment is carried out, and in order to more intuitively express the convergence speed and stability of Lstm and GRU, the performances of two classification networks under different iteration times are respectively tested. Experiments prove that in the model training process, due to the specific network structure of the GRU, the interior of the GRU is one gate less than Lstm, and the number of the introduced parameters is less, so that the training speed of the GRU is integrally higher than that of the Lstm network, when the iteration times are close to 10000 times, the two networks tend to be in a basically stable state, the oscillation interval of the GRU is [0.85,1.1], and the oscillation interval of the Lstm is [0.8,1.2], and therefore, the convergence speed of the GRU classification network is higher than that of the Lstm. From table 4 it can be shown that: the method has the advantages that the openpos is used for extracting the characteristics of the joint points of the traffic police gesture, the timing sequence information of the joint points extracted by the GRU network is used, and finally a good recognition effect can be achieved.

TABLE 4

Method	Proposer	Accuracy
			(Ours)OpenPose+GRU		91.51％
PKEN gesture extraction + Lstm	Zhang,et.al[7]	90.18％
			3D Convolutional neural net	Ji,et.al[26]	81.31％
Convolutional Lstm	Xing,et.al[27]	80.04％
			Resnet attitude extraction + Lstm	He,et.al[28]	87.51％

Embodiment 2 of the present invention provides a traffic police gesture recognition system based on skeleton recognition and a gated loop network, as shown in fig. 5, including a data acquisition module, a feature extraction module, a screening module, and a gesture recognition module; wherein,

the characteristic extraction module is used for extracting joint points of the traffic police command gesture image by utilizing an OpenPose algorithm to obtain joint point data;

the screening module is used for screening key point data which has large influence on the gesture recognition of the traffic police in the joint point data and taking the key point data as a training set;

and the gesture recognition module is used for inputting the training set into the GRU network to extract the time sequence characteristics of the traffic police command gestures.

The system also comprises a data preprocessing module which is used for carrying out time delay processing in the process of carrying out data marking on the traffic police command gesture image.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A traffic police gesture recognition method based on skeleton recognition and a gated loop network is characterized by comprising the following specific steps:

acquiring a traffic police command gesture image;

2. The method for recognizing the traffic police gesture based on the skeleton recognition and the gated loop network according to claim 1, wherein the step of acquiring the image of the traffic police command gesture is as follows: collecting video clips of traffic police command gestures at different visual angles, different scenes and different speeds; and dividing the video clip into continuous image sequences to further obtain the traffic police command gesture image.

3. The method for recognizing the traffic police gesture based on the skeleton recognition and the gated loop network as claimed in claim 1, wherein the step of performing joint point extraction on the traffic police command gesture image by using an OpenPose algorithm comprises the following steps: the traffic police command gesture image firstly extracts image features through a main network VGG19 to obtain first features, the first features are input to stage modules, the stages are serial modules, the structure and the function of each stage module are the same, each stage module is divided into two branches, the first branch generates Paf and is subjected to Loss solution, the second branch generates Pcm and is subjected to Loss solution, and the Loss solution of the first branch and the Loss solution of the second branch are added to obtain the final Loss.

4. The method for recognizing the traffic police gesture based on the skeleton recognition and the gated loop network according to claim 1, further comprising delaying the time of an effective gesture mark in the process of carrying out data marking on the traffic police command gesture image.

5. A traffic police gesture recognition system based on skeleton recognition and a gated loop network is characterized by comprising a data acquisition module, a feature extraction module, a screening module and a gesture recognition module; wherein,

6. The system of claim 5, further comprising a data preprocessing module configured to perform a time delay process during the data labeling process on the traffic police command gesture image.