CN117274960A - Non-driving gesture recognition method and system for L3-level automatic driving vehicle driver - Google Patents

Non-driving gesture recognition method and system for L3-level automatic driving vehicle driver Download PDF

Info

Publication number
CN117274960A
CN117274960A CN202310953085.4A CN202310953085A CN117274960A CN 117274960 A CN117274960 A CN 117274960A CN 202310953085 A CN202310953085 A CN 202310953085A CN 117274960 A CN117274960 A CN 117274960A
Authority
CN
China
Prior art keywords
gesture
driving
formula
driver
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310953085.4A
Other languages
Chinese (zh)
Inventor
马艳丽
徐小鹏
郭蓥蓥
张议文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202310953085.4A priority Critical patent/CN117274960A/en
Publication of CN117274960A publication Critical patent/CN117274960A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/08Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to drivers or passengers
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/223Posture, e.g. hand, foot, or seat position, turned or inclined

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Automation & Control Theory (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a non-driving gesture recognition method of an L3-level automatic driving vehicle driver, which comprises the following steps of: step one, monitoring and collecting non-driving gesture video data of a driver in real time; step two, extracting local characteristic data of the non-driving gesture; step three, classifying the non-driving gesture global features; and step four, identifying the global features of the non-driving gestures. According to the invention, under the L3-level automatic driving condition, the non-driving gesture of the driver is identified, the head gesture identification algorithm, the gazing direction estimation algorithm, the EAR algorithm and the OpenPose algorithm are utilized to identify the head gesture, the eye gesture and the hand gesture, the local gesture characteristics are quantized, the identification and classification of the non-driving gesture are realized, the safety and the reliability of the automatic driving vehicle are improved, and the OpenPose algorithm has the advantages of good real-time performance and high precision, and the model identification accuracy constructed by the invention can reach 91.5%.

Description

Non-driving gesture recognition method and system for L3-level automatic driving vehicle driver
Technical Field
The invention belongs to the field of traffic safety, and particularly relates to a non-driving gesture recognition method and system for a driver of an L3-level automatic driving vehicle.
Background
The driving pressure of a driver is effectively relieved through the vigorous development of the automatic driving technology, and in the L3-level automatic driving, the driver can execute non-driving related tasks in the running process of the vehicle and is in different non-driving postures. However, this also affects the driver taking over the autonomous vehicle, so that it is necessary to recognize the non-driving posture of the driver to improve the safety and reliability of the autonomous vehicle.
At present, aiming at the behavior recognition of a driver, the recognition of the driving state of a non-automatic driving vehicle is mainly focused, and the gesture of the driver during automatic driving is greatly different from that of the driver of the non-automatic driving vehicle. The existing driver gesture recognition mostly adopts methods such as reinforcement learning, deep learning, graph convolution neural network and the like, and has weak real-time performance and high precision.
Disclosure of Invention
The invention aims to solve the problems and further provides a non-driving gesture recognition method and system for a driver of an L3-level automatic driving vehicle.
The technical scheme adopted by the invention is as follows: a non-driving gesture recognition method of an L3-level automatic driving vehicle driver comprises the following steps:
step one, monitoring and collecting non-driving gesture video data of a driver in real time;
step two, extracting local characteristic data of the non-driving gesture;
step three, classifying the non-driving gesture global features;
and step four, identifying the global features of the non-driving gestures.
Further, in the first step, the non-driving gesture of the driver is recorded in real time, the gesture of the upper body of the driver is recorded, and the distance from the foot of the driver to the pedal in the driving process of the vehicle is recorded.
In the second step, non-driving gesture local feature extraction is performed according to the acquired data, and the head gesture extraction method is as follows:
firstly, detecting face key points, and selecting the face key points as a research object;
the relationship between the image coordinate system and the world coordinate system is shown in the formula (1):
wherein R is a rotation matrix, T is a translation matrix, (X, Y, Z) is a point in a world coordinate system, (U, V, W) is a point in an image coordinate system, s is a depth, and the value of a target point in the Z direction of a camera coordinate system is obtained;
the conversion of the camera coordinate system to the image center coordinate system is shown in formula (2):
wherein (X, Y, Z) is a point in the camera coordinate system and (u, v) is a point in the image coordinate system;
the conversion of the image center coordinate system to the image coordinate system is shown in formula (3):
wherein, (x, y) is a point in the image center coordinate system and (u, v) is a point in the image coordinate system;
fitting a 3D face model by using 3D Morphable Model, obtaining a rotation matrix through OpenCV, and solving a corresponding rotation angle by using a Rodrign rotation formula; the rotation movement angle around the Y axis is alpha, the rotation movement angle around the Z axis is beta, and the rotation movement angle around the X axis is gamma;
the head gesture within the range of alpha, beta, gamma and gamma is gathered to be right ahead, wherein alpha is 3 to 0, beta is 3 to 1, gamma is 1 to 1;
the head gestures within the range of (alpha, -10 to-5, beta, -15 to-5 and gamma: 8 to 15) are classified as pointing to the lower right;
the head gesture in the range of (alpha: 5-15, beta: 10-25, gamma: 2-11) is collected to the left.
In the second step, non-driving gesture local feature extraction is performed according to the acquired data, and the eye feature extraction method is as follows:
first, connecting the camera source point with the pupil center to obtain the intersection point of the straight line and the eyeball sphere. The expression equation of the straight line is shown as formula (4):
in the method, the source point of the camera is O c The pupil center is T, the eyeball center point is E, the eyeball radius is R, the fovea is point P, and the eye characteristic points can be obtained by a 3D model;
the constraint equation of the center pit P is shown in formula (5):
(X-X E ) 2 +(Y-Y E ) 2 +(Z-Z E ) 2 =R 2 (5)
rays emitted by the fovea pass through the pupil center to be the estimated direction of the visual axis;
detecting the opening and closing degree of eyes, and calculating the height-width ratio of the detected eye key nodes;
the calculation formula of the eye height-width ratio is shown in formula (6):
the eye closure with EAR less than 80% indicates the blink frequency as shown in formula (7):
wherein F represents blink frequency, F close The number of frames of blinks in a unit time is represented, and F is the total number of frames in the unit time;
the eye aspect ratio threshold for determining eye closure is calculated as shown in equation (8):
EAR close =(EAR max -EAR min )×(1-x)+EAR min (8)
in the formula, EAR close An eye aspect ratio threshold that determines eye closure; EAR (EAR) max Is the maximum tensor; EAR (EAR) min Is the minimum opening degree, x is the eye opening degree;
a blink may be determined based on the detected EAR value if it may be determined that 3 consecutive frames are below 0.18.
In the second step, non-driving gesture local feature extraction is performed according to the acquired data, and the hand gesture extraction method is as follows:
in openelse, taking the original image as a feature map F input in a first stage of branching; the first stage has two branches, the first branch will output a key point confidence map S 1 =ρ 1 (F) A second branch outputs a joint vector field set L 1 =φ 1 (F) The outputs are shown in the formulas (9) and (10):
wherein ρ is 1 And phi 1 The CNN network of the first stage is represented, t is the stage number, the output key point confidence map S comprises J confidence maps which represent J key points, the joint vector field set L comprises C vector fields which represent C limbs;
the two loss functions at the t-th stage are shown in the formula (11) and the formula (12):
wherein S is * j 、L * c The method comprises the steps of respectively representing a true key point confidence map and a key point connection vector in two branches, wherein W (p) is a Boolean value, when a position p is not marked in an image, the Boolean value is 0, and otherwise, the Boolean value is 1; the loss function f of the whole model is shown as formula (13):
for each pixel point p in the image, its true confidence S * j,k (p) is represented by formula (14):
wherein x is j,k For the true position of the jth node of the kth person, σ is the model parameter, the peak range of confidence is set. When the confidence peaks of different nodes have repetition and intersection on the pixel point p, the maximum key node confidence value is taken, as shown in a formula (15):
wherein S is * j (p) represents the confidence of the image reality, the dimension is w×h×j, W, H is the size of the input image;
for the node d j2 And d j1 The correlation of (2) is shown in the formula (16) and the formula (17):
p(u)=(1-u)d j1 +ud j2 (17)
wherein p (u) represents the node d j1 And d j2 Point sampling in between, L c (p (u)) represents the PAF value of limb c at p (u), and u represents the scaling factor. When L c (p (u)) and unit vectorThe smaller the included angle, the greater the correlation thereof;
finally, a correlation set containing a joint point confidence map, an affinity domain and joint points is obtained, the final joint connection problem is regarded as a bipartite map problem, a Hungary algorithm is utilized to finish two classifications, and finally, a posture estimation map is formed;
according to the human body key point coordinates identified by OpenPose, constructing a projection Euclidean distance and calculating, wherein the projection Euclidean distance is shown as a formula (18):
wherein L is i I= [1,2 ] is the projected Euclidean distance between key points],(x j ,y j ) The coordinates of key nodes are respectively;
if the right hand coordinate is not detected, the right hand gesture is considered to be the right lower part;
if the calculated wrist-to-neck distance is within the range of (500, 700), the hand gesture is considered to be in front of the body;
if the wrist to neck distance is calculated to be within the range of (200, 300), then the hand pose is considered to be either the right or left side of the head.
Further, in the third step, global feature classification is performed on the non-driving gesture, and finally, the non-driving gesture type is determined. The non-driving gestures are classified into a to j classes according to the combination of the range of values of the head gestures (α, β, γ), the eye gaze directions (right front, left, right lower), the range of values of the left and right hand gestures (wrist to neck distance), and the right foot gestures (pedal up, pedal front).
In the fourth step, the skeleton structure diagram of the driver is extracted by using openPose, the image is processed frame by using a graph convolution neural network, the spatial characteristics of the non-driving gesture are extracted, the spatial characteristics are input into an LSTM network of each period, finally, an attention mechanism is added, the output of the LSTM network is automatically subjected to weighted fusion, the characteristics of the final non-driving gesture characteristics are obtained, and the corresponding non-driving gesture classification is given through a full connection layer and an added softMax function as a classifier.
In the fourth step, a skeleton structure diagram obtained by an OpenPsoe algorithm is sent into a graph convolution neural network by adopting parallel operation logic to obtain the spatial structure characteristic output of the human skeletonSequentially obtain the space characteristic sequenceWherein t is the sequence length;
the output V of the graph convolution neural network is used as the input of LSTM, and the non-driving gesture characteristic H= (H) at each moment is calculated and output through the single-layer LSTM network 1 ,h 2 ,h 3 ,h t );
Memory cell C of which one forgetting gate determines the last moment t-1 Can be reserved to the current time C t Is an information amount of (a); the calculation formula is shown as formula (19):
f i =σ(W f ·[h t-1 ,x t ]+b f ) (19)
in which W is f Is the weight matrix of forgetting gate, [ h ] t-1 ,x t ]Representing the merging of two vectors end to end into one vector, b f A bias term representing a forgetting gate, sigma representing a sigmoid function;
the two input gates determine the input x of the network at the current moment t Memory cell C capable of being stored to the current time t Is used for the information amount of the (a). The calculation formulas are shown as formula (20) and formula (21):
i t =σ(W i ·[h t-1 ,x t ]+b i ) (20)
C t ’=tanh(W c ·[h t-1 ,x t ]+b c ) (21)
wherein i is t Is the output of the input gate, which indicates the amount of information i that should be updated to the memory cell at the current time t ∈(0,1);W i Is a weight matrix of an input gate, sigma is a sigmoid activation function, W c A weight matrix representing another part of the input gate, b i Bias term representing output gate, b c A bias term representing another part of the input gate, C t ' represents the memory cell input state at time t;
memory cell C at the current moment t The calculation mode of (2) is shown as the formula (22):
C t =f t ⊙C t-1 +i t ⊙C t ’ (22)
wherein f t For the output of the forget gate, as indicated by the multiplication of the matrix corresponding element;
the three output gates determine the memory cell C at the current time t How many parts can be output to the current hidden state h t . The calculation formulas are shown as formula (23) and formula (24):
o t =σ(W o ·[h t-1 ,x t ]+b o ) (23)
h t =o t ⊙tanh(C t ) (24)
in the formula, o t Is the output of the output gate, W o Is the weight matrix of the output gate, b o Bias term, h, representing forgetting gate t The hidden state at the time t is represented;
and an attention mechanism is added at the end of the LSTM, and the calculation formula of the attention mechanism is shown as a formula (25):
wherein H is an output sequence of an LSTM structure, and r is a learnable weight matrix;
the softMax converts the output of the LSTM into the weight at each moment, the weight is multiplied with the output of the LSTM to obtain the non-driving gesture space characteristic of the network output, and the identification type is obtained through the operation of the final full connection layer and the softMax function.
The invention also relates to a system of the non-driving gesture recognition method of the driver of the L3-level automatic driving vehicle, which comprises an information acquisition device, a non-driving gesture local feature extraction device and a driver non-driving gesture global feature classification and recognition device.
Further, the information acquisition device comprises a wireless transmitter and two cameras;
the non-driving gesture local feature extraction device comprises a wireless receiver, a skeleton recognition module, a head gesture feature extraction module, an eye gesture feature extraction module, a hand gesture feature extraction module, a foot gesture feature extraction module and a wireless transmitter; the wireless receiver is used for receiving the non-driving gesture sent by the information acquisition module, and the wireless transmitter is used for sending the extracted local features of the non-driving gesture to the non-driving gesture global feature classification and recognition device;
the non-driving gesture global feature classification and identification device comprises a wireless receiver, a non-driving gesture spatial feature extraction module and a non-driving gesture classification module. The wireless receiver is used for receiving the non-driving gesture local features sent by the non-driving gesture local feature extraction device, and the non-driving gesture spatial feature extraction module is used for identifying the types of the non-driving gestures by utilizing the LSTM network and the attention mechanism.
Drawings
FIG. 1 is a schematic diagram of a system architecture of the present invention;
FIG. 2 is a schematic diagram of a non-driving gesture recognition method according to the present invention;
fig. 3 is a schematic diagram of a gesture recognition network structure according to the present invention.
Advantageous effects
According to the invention, under the L3-level automatic driving condition, the non-driving gesture of the driver is identified, the head gesture identification algorithm, the gazing direction estimation algorithm, the EAR algorithm and the OpenPose algorithm are utilized to identify the head gesture, the eye gesture and the hand gesture, the local gesture characteristics are quantized, the non-driving gesture is identified and classified, the safety and the reliability of the automatic driving vehicle are improved, and the OpenPose algorithm has the advantages of good real-time performance and high precision, and the accuracy of the identification can reach 91.5% by utilizing the method.
Detailed Description
The present embodiment will be described below with reference to fig. 1 to 3.
The invention relates to a non-driving gesture recognition method of an automatic driving vehicle driver, which comprises the following steps of:
the technical scheme adopted by the invention for solving the technical problems is as follows: a non-driving gesture recognition method of an L3-level automatic driving vehicle driver comprises the following steps:
step one, monitoring and collecting non-driving gesture video data of a driver in real time;
the non-driving gesture of the driver is recorded in real time through the two cameras. A camera is arranged at a main driving position light shielding plate of the automatic driving vehicle and is used for recording the upper body posture of a driver; the other camera is arranged on the right side of the pedal plate and used for recording the distance between the feet of the driver and the pedal plate during the running process of the vehicle.
Step two, extracting local characteristic data of the non-driving gesture;
1. according to the acquired data, non-driving gesture local feature extraction is carried out, and the head gesture extraction method comprises the following steps:
firstly, detecting face key points, and selecting the face key points as a research object;
the relationship between the image coordinate system and the world coordinate system is shown in the formula (1):
wherein R is a rotation matrix, T is a translation matrix, (X, Y, Z) is a point in a world coordinate system, (U, V, W) is a point in an image coordinate system, s is a depth, and the value of a target point in the Z direction of a camera coordinate system is obtained;
the conversion of the camera coordinate system to the image center coordinate system is shown in formula (2):
wherein (X, Y, Z) is a point in the camera coordinate system and (u, v) is a point in the image coordinate system;
the conversion of the image center coordinate system to the image coordinate system is shown in formula (3):
wherein, (x, y) is a point in the image center coordinate system and (u, v) is a point in the image coordinate system;
fitting a 3D face model by using 3D Morphable Model, obtaining a rotation matrix through OpenCV, and solving a corresponding rotation angle by using a Rodrign rotation formula; the rotation movement angle around the Y axis is alpha, the rotation movement angle around the Z axis is beta, and the rotation movement angle around the X axis is gamma;
the head gesture within the range of alpha, beta, gamma and gamma is gathered to be right ahead, wherein alpha is 3 to 0, beta is 3 to 1, gamma is 1 to 1;
the head gestures within the range of (alpha, -10 to-5, beta, -15 to-5 and gamma: 8 to 15) are classified as pointing to the lower right;
the head gesture in the range of (alpha: 5-15, beta: 10-25, gamma: 2-11) is collected to the left.
2. The method for extracting the eye features comprises the following steps:
first, connecting the camera source point with the pupil center to obtain the intersection point of the straight line and the eyeball sphere. The expression equation of the straight line is shown as formula (4):
in the method, the source point of the camera is O c The pupil center is T, the eyeball center point is E, the eyeball radius is R, the fovea is point P, and the eye characteristic points can be obtained by a 3D model;
the constraint equation of the center pit P is shown in formula (5):
(X-X E ) 2 +(Y-Y E ) 2 +(Z-Z E ) 2 =R 2 (5)
rays emitted by the fovea pass through the pupil center to be the estimated direction of the visual axis;
detecting the opening and closing degree of eyes, and calculating the height-width ratio of the detected eye key nodes;
the calculation formula of the eye height-width ratio is shown in formula (6):
the eye closure with EAR less than 80% indicates the blink frequency as shown in formula (7):
wherein F represents blink frequency, F close The number of frames of blinks in a unit time is represented, and F is the total number of frames in the unit time;
the eye aspect ratio threshold for determining eye closure is calculated as shown in equation (8):
EAR close =(EAR max -EAR min )×(1-x)+EAR min (8)
in the formula, EAR close An eye aspect ratio threshold that determines eye closure; EAR (EAR) max Is the maximum tensor; EAR (EAR) min Is the minimum opening degree, x is the eye opening degree;
a blink may be determined based on the detected EAR value if it may be determined that 3 consecutive frames are below 0.18.
3. The hand gesture extraction method comprises the following steps:
in openelse, taking the original image as a feature map F input in a first stage of branching; the first stage has two branches, the first branch will output a key point confidence map S 1 =ρ 1 (F) A second branch outputs a joint vector field set L 1 =φ 1 (F) The outputs are shown in the formulas (9) and (10):
wherein ρ is 1 And phi 1 The CNN network of the first stage is represented, t is the stage number, the output key point confidence map S comprises J confidence maps which represent J key points, the joint vector field set L comprises C vector fields which represent C limbs;
the two loss functions at the t-th stage are shown in the formula (11) and the formula (12):
wherein S is * j 、L * c The method comprises the steps of respectively representing a true key point confidence map and a key point connection vector in two branches, wherein W (p) is a Boolean value, when a position p is not marked in an image, the Boolean value is 0, and otherwise, the Boolean value is 1; the loss function f of the whole model is shown as formula (13):
for each pixel point p in the image, its true confidence S * j,k (p) is represented by formula (14):
wherein x is j,k For the true position of the jth node of the kth person, σ is the model parameter, the peak range of confidence is set. When confidence peaks of different nodes exist in pixel point p, the repetition sumAnd when crossing, taking the maximum key node confidence value as shown in a formula (15):
wherein S is * j (p) represents the confidence of the image reality, the dimension is w×h×j, W, H is the size of the input image;
for the node d j2 And d j1 The correlation of (2) is shown in the formula (16) and the formula (17):
p(u)=(1-u)d j1 +ud j2 (17)
wherein p (u) represents the node d j1 And d j2 Point sampling in between, L c (p (u)) represents the PAF value of limb c at p (u), and u represents the scaling factor. When L c (p (u)) and unit vectorThe smaller the included angle, the greater the correlation thereof;
finally, a correlation set containing a joint point confidence map, an affinity domain and joint points is obtained, the final joint connection problem is regarded as a bipartite map problem, a Hungary algorithm is utilized to finish two classifications, and finally, a posture estimation map is formed;
according to the human body key point coordinates identified by OpenPose, constructing a projection Euclidean distance and calculating, wherein the projection Euclidean distance is shown as a formula (18):
wherein L is i I= [1,2 ] is the projected Euclidean distance between key points],(x j ,y j ) The coordinates of key nodes are respectively;
if the right hand coordinate is not detected, the right hand gesture is considered to be the right lower part;
if the calculated wrist-to-neck distance is within the range of (500, 700), the hand gesture is considered to be in front of the body;
if the wrist to neck distance is calculated to be within the range of (200, 300), then the hand pose is considered to be either the right or left side of the head.
4. The foot gesture extraction method is as follows: the right foot gesture feature is divided into on-pedal and in front of pedal.
And thirdly, classifying the non-driving gesture global features, and finally determining the non-driving gesture types. As shown in table 1.
TABLE 1 non-driving gesture feature classification
And step four, identifying the global features of the non-driving gestures.
Extracting a skeleton structure diagram of a driver by using OpenPose, processing the image frame by using a graph convolution neural network, extracting spatial features of non-driving gestures, inputting the spatial features into an LSTM network of each period, finally adding an attention mechanism, automatically carrying out weighted fusion on the output of the LSTM network to obtain the final features of the non-driving gesture features, and giving out corresponding non-driving gesture classification by using a full connection layer and an added softMax function as a classifier.
The skeleton structure diagram obtained by the OpenPsoe algorithm is sent into a graph convolution neural network by adopting parallel operation logic, and the spatial structure characteristic output of the human skeleton is obtainedSequentially obtain the space characteristic sequenceWherein t is the sequence length;
the output V of the graph convolution neural network is used as the input of LSTM, and the non-driving gesture characteristic H= (H) at each moment is calculated and output through the single-layer LSTM network 1 ,h 2 ,h 3 ,h t );
A forget gate decides the record of the last momentMemory cell C t-1 Can be reserved to the current time C t Is an information amount of (a); the calculation formula is shown as formula (19):
f i =σ(W f ·[h t-1 ,x t ]+b f ) (19)
in which W is f Is the weight matrix of forgetting gate, [ h ] t-1 ,x t ]Representing the merging of two vectors end to end into one vector, b f A bias term representing a forgetting gate, sigma representing a sigmoid function;
the two input gates determine the input x of the network at the current moment t Memory cell C capable of being stored to the current time t Is used for the information amount of the (a). The calculation formulas are shown as formula (20) and formula (21):
i t =σ(W i ·[h t-1 ,x t ]+b i ) (20)
C t ’=tanh(W c ·[h t-1 ,x t ]+b c ) (21)
wherein i is t Is the output of the input gate, which indicates the amount of information i that should be updated to the memory cell at the current time t ∈(0,1);W i Is a weight matrix of an input gate, sigma is a sigmoid activation function, W c A weight matrix representing another part of the input gate, b i Bias term representing output gate, b c A bias term representing another part of the input gate, C t ' represents the memory cell input state at time t;
memory cell C at the current moment t The calculation mode of (2) is shown as the formula (22):
C t =f t ⊙C t-1 +i t ⊙C t ’ (22)
wherein f t For the output of the forget gate, as indicated by the multiplication of the matrix corresponding element;
the three output gates determine the memory cell C at the current time t How many parts can be output to the current hidden state h t . The calculation formulas are shown as formula (23) and formula (24):
o t =σ(W o ·[h t-1 ,x t ]+b o ) (23)
h t =o t ⊙tanh(C t ) (24)
in the formula, o t Is the output of the output gate, W o Is the weight matrix of the output gate, b o Bias term, h, representing forgetting gate t The hidden state at the time t is represented;
and an attention mechanism is added at the end of the LSTM, and the calculation formula of the attention mechanism is shown as a formula (25):
wherein H is an output sequence of an LSTM structure, and r is a learnable weight matrix;
the softMax converts the output of the LSTM into the weight at each moment, the weight is multiplied with the output of the LSTM to obtain the non-driving gesture space characteristic of the network output, and the identification type is obtained through the operation of the final full connection layer and the softMax function.
Examples
In this embodiment, the driving simulator and the SCANeR studio software are used to simulate the scene and the automatic driving environment, and the camera is used to record the driver's posture.
And (3) setting conditions:
the experimental scene is designed as a straight line section of the expressway with two-way eight lanes, the length of the expressway section is set to 12km, the speed limit is 120km/h, the width of each lane is 3.75m, the central separation belt is set to 1m, the width of the left side road edge belt is set to 0.75m, and the width of the right side hard road shoulder is set to 1.5m.
The environment is a sunny day, and the traffic density is a steady flow. The automatic driving is set to drive on a right second vehicle way, the speed of the automatic driving vehicle is set to be 110km/h, when the vehicle encounters an emergency takeover scene, the vehicle can send out a voice prompt of 'automatic driving failure, please take over', and no-secondary tasks and three secondary tasks of operating a central control screen (operating a flat plate arranged on the right of a steering wheel), making a call and drinking water are set.
Experimental operation:
under the conditions of using the method of the invention and not using the method of the invention (support vector machine SVM and random forest algorithm), 30 test subjects were recruited and 120 experiments were performed on a driving simulator.
Conclusion:
by comparing the SVM algorithm with the random forest algorithm, the method can improve the algorithm identification accuracy, and the accuracy is 1.2% higher than that of the SVM algorithm and 2.4% higher than that of the random forest algorithm. The invention realizes the identification and classification of the non-driving gesture, and is beneficial to improving the safety and reliability of the automatic driving vehicle.
The above description of the present invention is not intended to limit the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims (10)

1. The non-driving gesture recognition method for the driver of the L3-level automatic driving vehicle is characterized by comprising the following steps of:
step one, monitoring and collecting non-driving gesture video data of a driver in real time;
step two, extracting local characteristic data of the non-driving gesture;
step three, classifying the non-driving gesture global features;
and step four, identifying the global features of the non-driving gestures.
2. The method for recognizing the non-driving posture of the driver of the L3-stage automatic driving vehicle according to claim 1, wherein in the first step, the non-driving posture of the driver is recorded in real time, the upper body posture of the driver is recorded, and the distance from the foot of the driver to the pedal during the running of the vehicle is recorded.
3. The method for recognizing non-driving gestures of a driver of an L3-level autonomous vehicle according to claim 1, wherein in the second step, non-driving gesture local feature extraction is performed according to the collected data, and the method for extracting the head gesture is as follows:
firstly, detecting face key points, and selecting the face key points as a research object;
the relationship between the image coordinate system and the world coordinate system is shown in the formula (1):
wherein R is a rotation matrix, T is a translation matrix, (X, Y, Z) is a point in a world coordinate system, (U, V, W) is a point in an image coordinate system, s is a depth, and the value of a target point in the Z direction of a camera coordinate system is obtained;
the conversion of the camera coordinate system to the image center coordinate system is shown in formula (2):
wherein (X, Y, Z) is a point in the camera coordinate system and (u, v) is a point in the image coordinate system;
the conversion of the image center coordinate system to the image coordinate system is shown in formula (3):
wherein, (x, y) is a point in the image center coordinate system and (u, v) is a point in the image coordinate system;
fitting a 3D face model by using 3D Morphable Model, obtaining a rotation matrix through OpenCV, and solving a corresponding rotation angle by using a Rodrign rotation formula; the rotation movement angle around the Y axis is alpha, the rotation movement angle around the Z axis is beta, and the rotation movement angle around the X axis is gamma;
the head gesture within the range of alpha, beta, gamma and gamma is gathered to be right ahead, wherein alpha is 3 to 0, beta is 3 to 1, gamma is 1 to 1;
the head gestures within the range of (alpha, -10 to-5, beta, -15 to-5 and gamma: 8 to 15) are classified as pointing to the lower right;
the head gesture in the range of (alpha: 5-15, beta: 10-25, gamma: 2-11) is collected to the left.
4. The method for recognizing non-driving gestures of a driver of an L3-level automatic driving vehicle according to claim 1, wherein in the second step, non-driving gesture local feature extraction is performed according to the collected data, and the method for extracting eye features is as follows:
firstly, connecting a camera source point and the center of a pupil to obtain an intersection point of a straight line and the spherical surface of an eyeball; the expression equation of the straight line is shown as formula (4):
in the method, the source point of the camera is O c The pupil center is T, the eyeball center point is E, the eyeball radius is R, the fovea is point P, and the eye characteristic points can be obtained by a 3D model;
the constraint equation of the center pit P is shown in formula (5):
(X-X E ) 2 +(Y-Y E ) 2 +(Z-Z E ) 2 =R 2 (5)
rays emitted by the fovea pass through the pupil center to be the estimated direction of the visual axis;
detecting the opening and closing degree of eyes, and calculating the height-width ratio of the detected eye key nodes;
the calculation formula of the eye height-width ratio is shown in formula (6):
the eye closure with EAR less than 80% indicates the blink frequency as shown in formula (7):
wherein F represents blink frequency, F close The number of frames of blinks in a unit time is represented, and F is the total number of frames in the unit time;
the eye aspect ratio threshold for determining eye closure is calculated as shown in equation (8):
EAR close =(EAR max -EAR min )×(1-x)+EAR min (8)
in the formula, EAR close An eye aspect ratio threshold that determines eye closure; EAR (EAR) max Is the maximum tensor; EAR (EAR) min Is the minimum opening degree, x is the eye opening degree;
a blink may be determined based on the detected EAR value if it may be determined that 3 consecutive frames are below 0.18.
5. The method for recognizing non-driving gestures of a driver of an L3-level automatic driving vehicle according to claim 1, wherein in the second step, non-driving gesture local feature extraction is performed according to the collected data, and the hand gesture extraction method is as follows:
in openelse, taking the original image as a feature map F input in a first stage of branching; the first stage has two branches, the first branch will output a key point confidence map S 1 =ρ 1 (F) A second branch outputs a joint vector field set L 1 =φ 1 (F) The outputs are shown in the formulas (9) and (10):
wherein ρ is 1 And phi 1 The CNN network representing the first stage, t is the stage number, the output key point confidence map S comprises J confidence maps which represent J key points, the joint vector field set L comprises C vector fields which represent C limbsDrying;
the two loss functions at the t-th stage are shown in the formula (11) and the formula (12):
wherein S is * j 、L * c The method comprises the steps of respectively representing a true key point confidence map and a key point connection vector in two branches, wherein W (p) is a Boolean value, when a position p is not marked in an image, the Boolean value is 0, and otherwise, the Boolean value is 1; the loss function f of the whole model is shown as formula (13):
for each pixel point p in the image, its true confidence S * j,k (p) is represented by formula (14):
wherein x is j,k Setting a peak range of confidence for the true position sigma of the jth joint point of the kth person, which is a model parameter; when the confidence peaks of different nodes have repetition and intersection on the pixel point p, the maximum key node confidence value is taken, as shown in a formula (15):
wherein S is * j (p) represents the confidence of the image reality, the dimension is w×h×j, W, H is the size of the input image;
for the node d j2 And d j1 The correlation of (2) is shown in the formula (16) and the formula (17):
p(u)=(1-u)d j1 +ud j2 (17)
wherein p (u) represents the node d j1 And d j2 Point sampling in between, L c (p (u)) represents the PAF value of limb c at p (u), u representing the scaling factor; when L c (p (u)) and unit vectorThe smaller the included angle, the greater the correlation thereof;
finally, a correlation set containing a joint point confidence map, an affinity domain and joint points is obtained, the final joint connection problem is regarded as a bipartite map problem, a Hungary algorithm is utilized to finish two classifications, and finally, a posture estimation map is formed;
according to the human body key point coordinates identified by OpenPose, constructing a projection Euclidean distance and calculating, wherein the projection Euclidean distance is shown as a formula (18):
wherein L is i I= [1,2 ] is the projected Euclidean distance between key points],(x j ,y j ) The coordinates of key nodes are respectively;
if the right hand coordinate is not detected, the right hand gesture is considered to be the right lower part;
if the calculated wrist-to-neck distance is within the range of (500, 700), the hand gesture is considered to be in front of the body;
if the wrist to neck distance is calculated to be within the range of (200, 300), then the hand pose is considered to be either the right or left side of the head.
6. The method for recognizing the non-driving gesture of the driver of the L3-level automatic driving vehicle according to claim 1, wherein in the third step, the non-driving gesture is classified into a-j types according to the combination of the range of the head gesture, the eye gaze direction, the range of the left and right hand gestures, and the right foot gesture.
7. The method for recognizing the non-driving gesture of the driver of the L3-level automatic driving vehicle according to claim 1, wherein in the fourth step, a skeleton structure diagram of the driver is extracted by using OpenPose, the image is processed frame by using a graph convolution neural network, the spatial characteristics of the non-driving gesture are extracted and input into an LSTM network of each period, finally, an attention mechanism is added, the output of the LSTM network is automatically weighted and fused to obtain the characteristics of the final non-driving gesture characteristics, and the corresponding non-driving gesture classification is given by using a full connection layer and an added softMax function as a classifier.
8. The method for recognizing non-driving gestures of driver of L3-level automatic driving vehicle according to claim 7, wherein in step four, a skeleton structure diagram obtained by OpenPsoe algorithm is sent into a graph convolution neural network by adopting parallel operation logic to obtain the spatial structure characteristic output of human skeletonThe spatial feature sequence +.> Wherein t is the sequence length;
the output V of the graph convolution neural network is used as the input of LSTM, and the non-driving gesture characteristic H= (H) at each moment is calculated and output through the single-layer LSTM network 1 ,h 2 ,h 3 ,h t );
Memory cell C of which one forgetting gate determines the last moment t-1 Can be reserved to the current time C t Is an information amount of (a); calculation ofThe formula is shown as formula (19):
f i =σ(W f ·[h t-1 ,x t ]+b f ) (19)
in which W is f Is the weight matrix of forgetting gate, [ h ] t-1 ,x t ]Representing the merging of two vectors end to end into one vector, b f A bias term representing a forgetting gate, sigma representing a sigmoid function;
the two input gates determine the input x of the network at the current moment t Memory cell C capable of being stored to the current time t Is an information amount of (a); the calculation formulas are shown as formula (20) and formula (21):
i t =σ(W i ·[h t-1 ,x t ]+b i ) (20)
C t ’=tanh(W c ·[h t-1 ,x t ]+b c ) (21)
wherein i is t Is the output of the input gate, which indicates the amount of information i that should be updated to the memory cell at the current time t ∈(0,1);W i Is a weight matrix of an input gate, sigma is a sigmoid activation function, W c A weight matrix representing another part of the input gate, b i Bias term representing output gate, b c A bias term representing another part of the input gate, C t Representing the memory cell input state at time t;
memory cell C at the current moment t The calculation mode of (2) is shown as the formula (22):
C t =f t ⊙C t-1 +i t ⊙C t ’ (22)
wherein f t For the output of the forget gate, as indicated by the multiplication of the matrix corresponding element;
the three output gates determine the memory cell C at the current time t How many parts can be output to the current hidden state h t The method comprises the steps of carrying out a first treatment on the surface of the The calculation formulas are shown as formula (23) and formula (24):
o t =σ(W o ·[h t-1 ,x t ]+b o ) (23)
h t =o t ⊙tanh(C t ) (24)
in the formula, o t Is the output of the output gate, W o Is the weight matrix of the output gate, b o Bias term, h, representing forgetting gate t The hidden state at the time t is represented;
and an attention mechanism is added at the end of the LSTM, and the calculation formula of the attention mechanism is shown as a formula (25):
wherein H is an output sequence of an LSTM structure, and r is a learnable weight matrix;
the softMax converts the output of the LSTM into the weight at each moment, the weight is multiplied with the output of the LSTM to obtain the non-driving gesture space characteristic of the network output, and the identification type is obtained through the operation of the final full connection layer and the softMax function.
9. A system of the L3-level automatic driving vehicle driver non-driving posture recognition method according to any one of claims 1 to 8, characterized in that the system includes an information acquisition device, a non-driving posture local feature extraction device, and a driver non-driving posture global feature classification and recognition device.
10. The L3 level autonomous vehicle driver non-driving gesture recognition system of claim 9, wherein the information acquisition device comprises a wireless transmitter and two cameras;
the non-driving gesture local feature extraction device comprises a wireless receiver, a skeleton recognition module, a head gesture feature extraction module, an eye gesture feature extraction module, a hand gesture feature extraction module, a foot gesture feature extraction module and a wireless transmitter; the wireless receiver is used for receiving the non-driving gesture sent by the information acquisition module, and the wireless transmitter is used for sending the extracted local features of the non-driving gesture to the non-driving gesture global feature classification and recognition device;
the non-driving gesture global feature classification and identification device comprises a wireless receiver, a non-driving gesture spatial feature extraction module and a non-driving gesture classification module; the wireless receiver is used for receiving the non-driving gesture local features sent by the non-driving gesture local feature extraction device, and the non-driving gesture spatial feature extraction module is used for identifying the types of the non-driving gestures by utilizing the LSTM network and the attention mechanism.
CN202310953085.4A 2023-08-01 2023-08-01 Non-driving gesture recognition method and system for L3-level automatic driving vehicle driver Pending CN117274960A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310953085.4A CN117274960A (en) 2023-08-01 2023-08-01 Non-driving gesture recognition method and system for L3-level automatic driving vehicle driver

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310953085.4A CN117274960A (en) 2023-08-01 2023-08-01 Non-driving gesture recognition method and system for L3-level automatic driving vehicle driver

Publications (1)

Publication Number Publication Date
CN117274960A true CN117274960A (en) 2023-12-22

Family

ID=89214833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310953085.4A Pending CN117274960A (en) 2023-08-01 2023-08-01 Non-driving gesture recognition method and system for L3-level automatic driving vehicle driver

Country Status (1)

Country Link
CN (1) CN117274960A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117649454A (en) * 2024-01-29 2024-03-05 北京友友天宇***技术有限公司 Binocular camera external parameter automatic correction method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117649454A (en) * 2024-01-29 2024-03-05 北京友友天宇***技术有限公司 Binocular camera external parameter automatic correction method and device, electronic equipment and storage medium
CN117649454B (en) * 2024-01-29 2024-05-31 北京友友天宇***技术有限公司 Binocular camera external parameter automatic correction method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Liao et al. Dynamic sign language recognition based on video sequence with BLSTM-3D residual networks
Konstantinidis et al. Sign language recognition based on hand and body skeletal data
Chen et al. Survey of pedestrian action recognition techniques for autonomous driving
Teichman et al. Towards 3D object recognition via classification of arbitrary object tracks
KR102224253B1 (en) Teacher-student framework for light weighted ensemble classifier combined with deep network and random forest and the classification method based on thereof
Owens et al. Application of the self-organising map to trajectory classification
Jojic et al. Tracking self-occluding articulated objects in dense disparity maps
CN112395951B (en) Complex scene-oriented domain-adaptive traffic target detection and identification method
CN106295568A (en) The mankind's naturalness emotion identification method combined based on expression and behavior bimodal
EP1739594A1 (en) Peripersonal space and object recognition for humanoid robots
CN104463191A (en) Robot visual processing method based on attention mechanism
Roche et al. A multimodal data processing system for LiDAR-based human activity recognition
CN110717389B (en) Driver fatigue detection method based on generation countermeasure and long-short term memory network
KR20180051335A (en) A method for input processing based on neural network learning algorithm and a device thereof
CN114120439B (en) Pedestrian intention multitasking recognition and track prediction method under intelligent automobile self-view angle
CN113850270A (en) Semantic scene completion method and system based on point cloud-voxel aggregation network model
CN111881802B (en) Traffic police gesture recognition method based on double-branch space-time graph convolutional network
CN117274960A (en) Non-driving gesture recognition method and system for L3-level automatic driving vehicle driver
Dewangan et al. Towards the design of vision-based intelligent vehicle system: methodologies and challenges
Serpush et al. Complex human action recognition in live videos using hybrid FR-DL method
KR102178469B1 (en) Method and system for estimation of pedestrian pose orientation using soft target training based on teacher-student framework
Park et al. A single depth sensor based human activity recognition via convolutional neural network
CN111291607A (en) Driver distraction detection method, driver distraction detection device, computer equipment and storage medium
Jeong et al. Driving Scene Understanding Using Hybrid Deep Neural Network
Gao Basketball posture recognition based on HOG feature extraction and convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination