CN117274960A

CN117274960A - Non-driving gesture recognition method and system for L3-level automatic driving vehicle driver

Info

Publication number: CN117274960A
Application number: CN202310953085.4A
Authority: CN
Inventors: 马艳丽; 徐小鹏; 郭蓥蓥; 张议文
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-08-01
Filing date: 2023-08-01
Publication date: 2023-12-22

Abstract

The invention relates to a non-driving gesture recognition method of an L3-level automatic driving vehicle driver, which comprises the following steps of: step one, monitoring and collecting non-driving gesture video data of a driver in real time; step two, extracting local characteristic data of the non-driving gesture; step three, classifying the non-driving gesture global features; and step four, identifying the global features of the non-driving gestures. According to the invention, under the L3-level automatic driving condition, the non-driving gesture of the driver is identified, the head gesture identification algorithm, the gazing direction estimation algorithm, the EAR algorithm and the OpenPose algorithm are utilized to identify the head gesture, the eye gesture and the hand gesture, the local gesture characteristics are quantized, the identification and classification of the non-driving gesture are realized, the safety and the reliability of the automatic driving vehicle are improved, and the OpenPose algorithm has the advantages of good real-time performance and high precision, and the model identification accuracy constructed by the invention can reach 91.5%.

Description

Non-driving gesture recognition method and system for L3-level automatic driving vehicle driver

Technical Field

The invention belongs to the field of traffic safety, and particularly relates to a non-driving gesture recognition method and system for a driver of an L3-level automatic driving vehicle.

Background

The driving pressure of a driver is effectively relieved through the vigorous development of the automatic driving technology, and in the L3-level automatic driving, the driver can execute non-driving related tasks in the running process of the vehicle and is in different non-driving postures. However, this also affects the driver taking over the autonomous vehicle, so that it is necessary to recognize the non-driving posture of the driver to improve the safety and reliability of the autonomous vehicle.

At present, aiming at the behavior recognition of a driver, the recognition of the driving state of a non-automatic driving vehicle is mainly focused, and the gesture of the driver during automatic driving is greatly different from that of the driver of the non-automatic driving vehicle. The existing driver gesture recognition mostly adopts methods such as reinforcement learning, deep learning, graph convolution neural network and the like, and has weak real-time performance and high precision.

Disclosure of Invention

The invention aims to solve the problems and further provides a non-driving gesture recognition method and system for a driver of an L3-level automatic driving vehicle.

The technical scheme adopted by the invention is as follows: a non-driving gesture recognition method of an L3-level automatic driving vehicle driver comprises the following steps:

step one, monitoring and collecting non-driving gesture video data of a driver in real time;

step two, extracting local characteristic data of the non-driving gesture;

step three, classifying the non-driving gesture global features;

and step four, identifying the global features of the non-driving gestures.

Further, in the first step, the non-driving gesture of the driver is recorded in real time, the gesture of the upper body of the driver is recorded, and the distance from the foot of the driver to the pedal in the driving process of the vehicle is recorded.

In the second step, non-driving gesture local feature extraction is performed according to the acquired data, and the head gesture extraction method is as follows:

firstly, detecting face key points, and selecting the face key points as a research object;

the relationship between the image coordinate system and the world coordinate system is shown in the formula (1):

wherein R is a rotation matrix, T is a translation matrix, (X, Y, Z) is a point in a world coordinate system, (U, V, W) is a point in an image coordinate system, s is a depth, and the value of a target point in the Z direction of a camera coordinate system is obtained;

the conversion of the camera coordinate system to the image center coordinate system is shown in formula (2):

wherein (X, Y, Z) is a point in the camera coordinate system and (u, v) is a point in the image coordinate system;

the conversion of the image center coordinate system to the image coordinate system is shown in formula (3):

wherein, (x, y) is a point in the image center coordinate system and (u, v) is a point in the image coordinate system;

fitting a 3D face model by using 3D Morphable Model, obtaining a rotation matrix through OpenCV, and solving a corresponding rotation angle by using a Rodrign rotation formula; the rotation movement angle around the Y axis is alpha, the rotation movement angle around the Z axis is beta, and the rotation movement angle around the X axis is gamma;

the head gesture within the range of alpha, beta, gamma and gamma is gathered to be right ahead, wherein alpha is 3 to 0, beta is 3 to 1, gamma is 1 to 1;

the head gestures within the range of (alpha, -10 to-5, beta, -15 to-5 and gamma: 8 to 15) are classified as pointing to the lower right;

the head gesture in the range of (alpha: 5-15, beta: 10-25, gamma: 2-11) is collected to the left.

In the second step, non-driving gesture local feature extraction is performed according to the acquired data, and the eye feature extraction method is as follows:

first, connecting the camera source point with the pupil center to obtain the intersection point of the straight line and the eyeball sphere. The expression equation of the straight line is shown as formula (4):

in the method, the source point of the camera is O _c The pupil center is T, the eyeball center point is E, the eyeball radius is R, the fovea is point P, and the eye characteristic points can be obtained by a 3D model;

the constraint equation of the center pit P is shown in formula (5):

(X-X _E ) ² +(Y-Y _E ) ² +(Z-Z _E ) ² ＝R ² (5)

rays emitted by the fovea pass through the pupil center to be the estimated direction of the visual axis;

detecting the opening and closing degree of eyes, and calculating the height-width ratio of the detected eye key nodes;

the calculation formula of the eye height-width ratio is shown in formula (6):

the eye closure with EAR less than 80% indicates the blink frequency as shown in formula (7):

wherein F represents blink frequency, F _close The number of frames of blinks in a unit time is represented, and F is the total number of frames in the unit time;

the eye aspect ratio threshold for determining eye closure is calculated as shown in equation (8):

EAR _close ＝(EAR _max -EAR _min )×(1-x)+EAR _min (8)

in the formula, EAR _close An eye aspect ratio threshold that determines eye closure; EAR (EAR) _max Is the maximum tensor; EAR (EAR) _min Is the minimum opening degree, x is the eye opening degree;

a blink may be determined based on the detected EAR value if it may be determined that 3 consecutive frames are below 0.18.

In the second step, non-driving gesture local feature extraction is performed according to the acquired data, and the hand gesture extraction method is as follows:

in openelse, taking the original image as a feature map F input in a first stage of branching; the first stage has two branches, the first branch will output a key point confidence map S ¹ ＝ρ ¹ (F) A second branch outputs a joint vector field set L ¹ ＝φ ¹ (F) The outputs are shown in the formulas (9) and (10):

wherein ρ is ¹ And phi ¹ The CNN network of the first stage is represented, t is the stage number, the output key point confidence map S comprises J confidence maps which represent J key points, the joint vector field set L comprises C vector fields which represent C limbs;

the two loss functions at the t-th stage are shown in the formula (11) and the formula (12):

wherein S is ^* _j 、L ^* _c The method comprises the steps of respectively representing a true key point confidence map and a key point connection vector in two branches, wherein W (p) is a Boolean value, when a position p is not marked in an image, the Boolean value is 0, and otherwise, the Boolean value is 1; the loss function f of the whole model is shown as formula (13):

for each pixel point p in the image, its true confidence S ^* _j，k (p) is represented by formula (14):

wherein x is _j，k For the true position of the jth node of the kth person, σ is the model parameter, the peak range of confidence is set. When the confidence peaks of different nodes have repetition and intersection on the pixel point p, the maximum key node confidence value is taken, as shown in a formula (15):

wherein S is ^* _j (p) represents the confidence of the image reality, the dimension is w×h×j, W, H is the size of the input image;

for the node d _j2 And d _j1 The correlation of (2) is shown in the formula (16) and the formula (17):

p(u)＝(1-u)d _j1 +ud _j2 (17)

wherein p (u) represents the node d _j1 And d _j2 Point sampling in between, L _c (p (u)) represents the PAF value of limb c at p (u), and u represents the scaling factor. When L _c (p (u)) and unit vectorThe smaller the included angle, the greater the correlation thereof;

finally, a correlation set containing a joint point confidence map, an affinity domain and joint points is obtained, the final joint connection problem is regarded as a bipartite map problem, a Hungary algorithm is utilized to finish two classifications, and finally, a posture estimation map is formed;

according to the human body key point coordinates identified by OpenPose, constructing a projection Euclidean distance and calculating, wherein the projection Euclidean distance is shown as a formula (18):

wherein L is _i I= [1,2 ] is the projected Euclidean distance between key points]，(x _j ，y _j ) The coordinates of key nodes are respectively;

if the right hand coordinate is not detected, the right hand gesture is considered to be the right lower part;

if the calculated wrist-to-neck distance is within the range of (500, 700), the hand gesture is considered to be in front of the body;

if the wrist to neck distance is calculated to be within the range of (200, 300), then the hand pose is considered to be either the right or left side of the head.

Further, in the third step, global feature classification is performed on the non-driving gesture, and finally, the non-driving gesture type is determined. The non-driving gestures are classified into a to j classes according to the combination of the range of values of the head gestures (α, β, γ), the eye gaze directions (right front, left, right lower), the range of values of the left and right hand gestures (wrist to neck distance), and the right foot gestures (pedal up, pedal front).

In the fourth step, the skeleton structure diagram of the driver is extracted by using openPose, the image is processed frame by using a graph convolution neural network, the spatial characteristics of the non-driving gesture are extracted, the spatial characteristics are input into an LSTM network of each period, finally, an attention mechanism is added, the output of the LSTM network is automatically subjected to weighted fusion, the characteristics of the final non-driving gesture characteristics are obtained, and the corresponding non-driving gesture classification is given through a full connection layer and an added softMax function as a classifier.

In the fourth step, a skeleton structure diagram obtained by an OpenPsoe algorithm is sent into a graph convolution neural network by adopting parallel operation logic to obtain the spatial structure characteristic output of the human skeletonSequentially obtain the space characteristic sequenceWherein t is the sequence length;

the output V of the graph convolution neural network is used as the input of LSTM, and the non-driving gesture characteristic H= (H) at each moment is calculated and output through the single-layer LSTM network ₁ ，h ₂ ，h ₃ ，h _t )；

Memory cell C of which one forgetting gate determines the last moment _t-1 Can be reserved to the current time C _t Is an information amount of (a); the calculation formula is shown as formula (19):

f _i ＝σ(W _f ·[h _t-1 ,x _t ]+b _f ) (19)

in which W is _f Is the weight matrix of forgetting gate, [ h ] _t-1 ，x _t ]Representing the merging of two vectors end to end into one vector, b _f A bias term representing a forgetting gate, sigma representing a sigmoid function;

the two input gates determine the input x of the network at the current moment _t Memory cell C capable of being stored to the current time _t Is used for the information amount of the (a). The calculation formulas are shown as formula (20) and formula (21):

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i ) (20)

C _t ’＝tanh(W _c ·[h _t-1 ,x _t ]+b _c ) (21)

wherein i is _t Is the output of the input gate, which indicates the amount of information i that should be updated to the memory cell at the current time _t ∈(0，1)；W _i Is a weight matrix of an input gate, sigma is a sigmoid activation function, W _c A weight matrix representing another part of the input gate, b _i Bias term representing output gate, b _c A bias term representing another part of the input gate, C _t ' represents the memory cell input state at time t;

memory cell C at the current moment _t The calculation mode of (2) is shown as the formula (22):

C _t ＝f _t ⊙C _t-1 +i _t ⊙C _t ’ (22)

wherein f _t For the output of the forget gate, as indicated by the multiplication of the matrix corresponding element;

the three output gates determine the memory cell C at the current time _t How many parts can be output to the current hidden state h _t . The calculation formulas are shown as formula (23) and formula (24):

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o ) (23)

h _t ＝o _t ⊙tanh(C _t ) (24)

in the formula, o _t Is the output of the output gate, W _o Is the weight matrix of the output gate, b _o Bias term, h, representing forgetting gate _t The hidden state at the time t is represented;

and an attention mechanism is added at the end of the LSTM, and the calculation formula of the attention mechanism is shown as a formula (25):

wherein H is an output sequence of an LSTM structure, and r is a learnable weight matrix;

the softMax converts the output of the LSTM into the weight at each moment, the weight is multiplied with the output of the LSTM to obtain the non-driving gesture space characteristic of the network output, and the identification type is obtained through the operation of the final full connection layer and the softMax function.

The invention also relates to a system of the non-driving gesture recognition method of the driver of the L3-level automatic driving vehicle, which comprises an information acquisition device, a non-driving gesture local feature extraction device and a driver non-driving gesture global feature classification and recognition device.

Further, the information acquisition device comprises a wireless transmitter and two cameras;

the non-driving gesture local feature extraction device comprises a wireless receiver, a skeleton recognition module, a head gesture feature extraction module, an eye gesture feature extraction module, a hand gesture feature extraction module, a foot gesture feature extraction module and a wireless transmitter; the wireless receiver is used for receiving the non-driving gesture sent by the information acquisition module, and the wireless transmitter is used for sending the extracted local features of the non-driving gesture to the non-driving gesture global feature classification and recognition device;

the non-driving gesture global feature classification and identification device comprises a wireless receiver, a non-driving gesture spatial feature extraction module and a non-driving gesture classification module. The wireless receiver is used for receiving the non-driving gesture local features sent by the non-driving gesture local feature extraction device, and the non-driving gesture spatial feature extraction module is used for identifying the types of the non-driving gestures by utilizing the LSTM network and the attention mechanism.

Drawings

FIG. 1 is a schematic diagram of a system architecture of the present invention;

FIG. 2 is a schematic diagram of a non-driving gesture recognition method according to the present invention;

fig. 3 is a schematic diagram of a gesture recognition network structure according to the present invention.

Advantageous effects

According to the invention, under the L3-level automatic driving condition, the non-driving gesture of the driver is identified, the head gesture identification algorithm, the gazing direction estimation algorithm, the EAR algorithm and the OpenPose algorithm are utilized to identify the head gesture, the eye gesture and the hand gesture, the local gesture characteristics are quantized, the non-driving gesture is identified and classified, the safety and the reliability of the automatic driving vehicle are improved, and the OpenPose algorithm has the advantages of good real-time performance and high precision, and the accuracy of the identification can reach 91.5% by utilizing the method.

Detailed Description

The present embodiment will be described below with reference to fig. 1 to 3.

The invention relates to a non-driving gesture recognition method of an automatic driving vehicle driver, which comprises the following steps of:

the technical scheme adopted by the invention for solving the technical problems is as follows: a non-driving gesture recognition method of an L3-level automatic driving vehicle driver comprises the following steps:

the non-driving gesture of the driver is recorded in real time through the two cameras. A camera is arranged at a main driving position light shielding plate of the automatic driving vehicle and is used for recording the upper body posture of a driver; the other camera is arranged on the right side of the pedal plate and used for recording the distance between the feet of the driver and the pedal plate during the running process of the vehicle.

Step two, extracting local characteristic data of the non-driving gesture;

1. according to the acquired data, non-driving gesture local feature extraction is carried out, and the head gesture extraction method comprises the following steps:

2. The method for extracting the eye features comprises the following steps:

the constraint equation of the center pit P is shown in formula (5):

(X-X _E ) ² +(Y-Y _E ) ² +(Z-Z _E ) ² ＝R ² (5)

the calculation formula of the eye height-width ratio is shown in formula (6):

EAR _close ＝(EAR _max -EAR _min )×(1-x)+EAR _min (8)

3. The hand gesture extraction method comprises the following steps:

wherein x is _j，k For the true position of the jth node of the kth person, σ is the model parameter, the peak range of confidence is set. When confidence peaks of different nodes exist in pixel point p, the repetition sumAnd when crossing, taking the maximum key node confidence value as shown in a formula (15):

p(u)＝(1-u)d _j1 +ud _j2 (17)

4. The foot gesture extraction method is as follows: the right foot gesture feature is divided into on-pedal and in front of pedal.

And thirdly, classifying the non-driving gesture global features, and finally determining the non-driving gesture types. As shown in table 1.

TABLE 1 non-driving gesture feature classification

And step four, identifying the global features of the non-driving gestures.

Extracting a skeleton structure diagram of a driver by using OpenPose, processing the image frame by using a graph convolution neural network, extracting spatial features of non-driving gestures, inputting the spatial features into an LSTM network of each period, finally adding an attention mechanism, automatically carrying out weighted fusion on the output of the LSTM network to obtain the final features of the non-driving gesture features, and giving out corresponding non-driving gesture classification by using a full connection layer and an added softMax function as a classifier.

The skeleton structure diagram obtained by the OpenPsoe algorithm is sent into a graph convolution neural network by adopting parallel operation logic, and the spatial structure characteristic output of the human skeleton is obtainedSequentially obtain the space characteristic sequenceWherein t is the sequence length;

A forget gate decides the record of the last momentMemory cell C _t-1 Can be reserved to the current time C _t Is an information amount of (a); the calculation formula is shown as formula (19):

f _i ＝σ(W _f ·[h _t-1 ,x _t ]+b _f ) (19)

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i ) (20)

C _t ’＝tanh(W _c ·[h _t-1 ,x _t ]+b _c ) (21)

C _t ＝f _t ⊙C _t-1 +i _t ⊙C _t ’ (22)

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o ) (23)

h _t ＝o _t ⊙tanh(C _t ) (24)

Examples

In this embodiment, the driving simulator and the SCANeR studio software are used to simulate the scene and the automatic driving environment, and the camera is used to record the driver's posture.

And (3) setting conditions:

the experimental scene is designed as a straight line section of the expressway with two-way eight lanes, the length of the expressway section is set to 12km, the speed limit is 120km/h, the width of each lane is 3.75m, the central separation belt is set to 1m, the width of the left side road edge belt is set to 0.75m, and the width of the right side hard road shoulder is set to 1.5m.

The environment is a sunny day, and the traffic density is a steady flow. The automatic driving is set to drive on a right second vehicle way, the speed of the automatic driving vehicle is set to be 110km/h, when the vehicle encounters an emergency takeover scene, the vehicle can send out a voice prompt of 'automatic driving failure, please take over', and no-secondary tasks and three secondary tasks of operating a central control screen (operating a flat plate arranged on the right of a steering wheel), making a call and drinking water are set.

Experimental operation:

under the conditions of using the method of the invention and not using the method of the invention (support vector machine SVM and random forest algorithm), 30 test subjects were recruited and 120 experiments were performed on a driving simulator.

Conclusion:

by comparing the SVM algorithm with the random forest algorithm, the method can improve the algorithm identification accuracy, and the accuracy is 1.2% higher than that of the SVM algorithm and 2.4% higher than that of the random forest algorithm. The invention realizes the identification and classification of the non-driving gesture, and is beneficial to improving the safety and reliability of the automatic driving vehicle.

The above description of the present invention is not intended to limit the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims

1. The non-driving gesture recognition method for the driver of the L3-level automatic driving vehicle is characterized by comprising the following steps of:

step two, extracting local characteristic data of the non-driving gesture;

step three, classifying the non-driving gesture global features;

and step four, identifying the global features of the non-driving gestures.

2. The method for recognizing the non-driving posture of the driver of the L3-stage automatic driving vehicle according to claim 1, wherein in the first step, the non-driving posture of the driver is recorded in real time, the upper body posture of the driver is recorded, and the distance from the foot of the driver to the pedal during the running of the vehicle is recorded.

3. The method for recognizing non-driving gestures of a driver of an L3-level autonomous vehicle according to claim 1, wherein in the second step, non-driving gesture local feature extraction is performed according to the collected data, and the method for extracting the head gesture is as follows:

4. The method for recognizing non-driving gestures of a driver of an L3-level automatic driving vehicle according to claim 1, wherein in the second step, non-driving gesture local feature extraction is performed according to the collected data, and the method for extracting eye features is as follows:

firstly, connecting a camera source point and the center of a pupil to obtain an intersection point of a straight line and the spherical surface of an eyeball; the expression equation of the straight line is shown as formula (4):

the constraint equation of the center pit P is shown in formula (5):

(X-X _E ) ² +(Y-Y _E ) ² +(Z-Z _E ) ² ＝R ² (5)

the calculation formula of the eye height-width ratio is shown in formula (6):

EAR _close ＝(EAR _max -EAR _min )×(1-x)+EAR _min (8)

5. The method for recognizing non-driving gestures of a driver of an L3-level automatic driving vehicle according to claim 1, wherein in the second step, non-driving gesture local feature extraction is performed according to the collected data, and the hand gesture extraction method is as follows:

wherein ρ is ¹ And phi ¹ The CNN network representing the first stage, t is the stage number, the output key point confidence map S comprises J confidence maps which represent J key points, the joint vector field set L comprises C vector fields which represent C limbsDrying;

wherein x is _j，k Setting a peak range of confidence for the true position sigma of the jth joint point of the kth person, which is a model parameter; when the confidence peaks of different nodes have repetition and intersection on the pixel point p, the maximum key node confidence value is taken, as shown in a formula (15):

p(u)＝(1-u)d _j1 +ud _j2 (17)

wherein p (u) represents the node d _j1 And d _j2 Point sampling in between, L _c (p (u)) represents the PAF value of limb c at p (u), u representing the scaling factor; when L _c (p (u)) and unit vectorThe smaller the included angle, the greater the correlation thereof;

wherein L is _i I= [1,2 ] is the projected Euclidean distance between key points]，(x _j ,y _j ) The coordinates of key nodes are respectively;

6. The method for recognizing the non-driving gesture of the driver of the L3-level automatic driving vehicle according to claim 1, wherein in the third step, the non-driving gesture is classified into a-j types according to the combination of the range of the head gesture, the eye gaze direction, the range of the left and right hand gestures, and the right foot gesture.

7. The method for recognizing the non-driving gesture of the driver of the L3-level automatic driving vehicle according to claim 1, wherein in the fourth step, a skeleton structure diagram of the driver is extracted by using OpenPose, the image is processed frame by using a graph convolution neural network, the spatial characteristics of the non-driving gesture are extracted and input into an LSTM network of each period, finally, an attention mechanism is added, the output of the LSTM network is automatically weighted and fused to obtain the characteristics of the final non-driving gesture characteristics, and the corresponding non-driving gesture classification is given by using a full connection layer and an added softMax function as a classifier.

8. The method for recognizing non-driving gestures of driver of L3-level automatic driving vehicle according to claim 7, wherein in step four, a skeleton structure diagram obtained by OpenPsoe algorithm is sent into a graph convolution neural network by adopting parallel operation logic to obtain the spatial structure characteristic output of human skeletonThe spatial feature sequence +.> Wherein t is the sequence length;

Memory cell C of which one forgetting gate determines the last moment _t-1 Can be reserved to the current time C _t Is an information amount of (a); calculation ofThe formula is shown as formula (19):

f _i ＝σ(W _f ·[h _t-1 ,x _t ]+b _f ) (19)

the two input gates determine the input x of the network at the current moment _t Memory cell C capable of being stored to the current time _t Is an information amount of (a); the calculation formulas are shown as formula (20) and formula (21):

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i ) (20)

C _t ’＝tanh(W _c ·[h _t-1 ,x _t ]+b _c ) (21)

wherein i is _t Is the output of the input gate, which indicates the amount of information i that should be updated to the memory cell at the current time _t ∈(0,1)；W _i Is a weight matrix of an input gate, sigma is a sigmoid activation function, W _c A weight matrix representing another part of the input gate, b _i Bias term representing output gate, b _c A bias term representing another part of the input gate, C _t Representing the memory cell input state at time t;

C _t ＝f _t ⊙C _t-1 +i _t ⊙C _t ’ (22)

the three output gates determine the memory cell C at the current time _t How many parts can be output to the current hidden state h _t The method comprises the steps of carrying out a first treatment on the surface of the The calculation formulas are shown as formula (23) and formula (24):

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o ) (23)

h _t ＝o _t ⊙tanh(C _t ) (24)

9. A system of the L3-level automatic driving vehicle driver non-driving posture recognition method according to any one of claims 1 to 8, characterized in that the system includes an information acquisition device, a non-driving posture local feature extraction device, and a driver non-driving posture global feature classification and recognition device.

10. The L3 level autonomous vehicle driver non-driving gesture recognition system of claim 9, wherein the information acquisition device comprises a wireless transmitter and two cameras;

the non-driving gesture global feature classification and identification device comprises a wireless receiver, a non-driving gesture spatial feature extraction module and a non-driving gesture classification module; the wireless receiver is used for receiving the non-driving gesture local features sent by the non-driving gesture local feature extraction device, and the non-driving gesture spatial feature extraction module is used for identifying the types of the non-driving gestures by utilizing the LSTM network and the attention mechanism.