CN113335291B - Man-machine driving-sharing control right decision method based on man-vehicle risk state - Google Patents

Man-machine driving-sharing control right decision method based on man-vehicle risk state Download PDF

Info

Publication number
CN113335291B
CN113335291B CN202110848303.9A CN202110848303A CN113335291B CN 113335291 B CN113335291 B CN 113335291B CN 202110848303 A CN202110848303 A CN 202110848303A CN 113335291 B CN113335291 B CN 113335291B
Authority
CN
China
Prior art keywords
vehicle
risk
human
driving
decision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110848303.9A
Other languages
Chinese (zh)
Other versions
CN113335291A (en
Inventor
郭柏苍
金立生
谢宪毅
贺阳
韩广德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yanshan University
Original Assignee
Yanshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yanshan University filed Critical Yanshan University
Priority to CN202110848303.9A priority Critical patent/CN113335291B/en
Publication of CN113335291A publication Critical patent/CN113335291A/en
Application granted granted Critical
Publication of CN113335291B publication Critical patent/CN113335291B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/08Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to drivers or passengers
    • B60W40/09Driving style or behaviour
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/08Interaction between the driver and the control system
    • B60W50/082Selecting or switching between different modes of propelling
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/80Technologies aiming to reduce greenhouse gasses emissions common to all road transportation technologies
    • Y02T10/84Data processing systems or methods, management, administration

Landscapes

  • Engineering & Computer Science (AREA)
  • Automation & Control Theory (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Traffic Control Systems (AREA)
  • Control Of Driving Devices And Active Controlling Of Vehicle (AREA)

Abstract

The invention relates to a man-machine driving-sharing control right decision method based on a man-vehicle risk state, belongs to the technical field of automobile auxiliary driving and the technical field of automatic driving, and particularly relates to a man-machine driving-sharing control right decision method based on a man-vehicle risk state; the method comprises the steps of extracting the environmental characteristics of an intelligent agent based on the risk monitoring information of people and vehicles, a reinforcement learning risk decision framework based on the complete information static game theory and a control right decision method for calibrating different switching occasions. The man-machine driving sharing control right decision method can provide theoretical support for timely switching the control right to the automatic driving system when the intelligent vehicle is in a high risk state, and the automatic driving system takes over the vehicle and reduces driving risks under special conditions.

Description

Man-machine driving sharing control right decision method based on man-vehicle risk state
Technical Field
The invention belongs to the technical field of automobile auxiliary driving and the technical field of automatic driving, and particularly relates to a man-machine driving sharing control right decision method based on a man-vehicle risk state.
Background
The man-machine co-driving indicates that a driver and an intelligent system are in the ring at the same time, share the control right of the vehicle, and the man-machine integration completes the driving task cooperatively. Compared with the common ADAS function, the co-driving type intelligent automobile has a double-ring parallel control structure because the man-machine is the control main body, and the controlled objects of the two sides are cross-linked and coupled and the state transfer is mutually restricted. The man-machine co-driving can form bidirectional information exchange and control by virtue of respective advantages of human intelligence and machine intelligence through the hybrid enhancement of the man-machine intelligence, so that a man-machine cooperation hybrid intelligent system with 1+1>2 is formed, the development of automobile intelligence can be promoted, and meanwhile, the development of automobile industry and artificial intelligence industry in China can be strongly supported.
Since smart cars from level 0 to level 4 (SAEJ3016 is level L1 to level L4) automation level all require man-machine cooperative driving, and the main cause of the vehicle getting into a dangerous state is the driver's misoperation or dangerous driving behavior, such phenomena will exist for a long time before stepping into the fully automated driving era. The research difficulty is mainly reflected in that when the control right of a driver is mastered, if the driver shows a risk driving behavior and the condition of system early warning is ignored, a set of reasonable and accurate man-machine common driving control right switching strategy needs to be formulated, the system takes over the control right of the vehicle in a risk scene, and the safety of the vehicle is improved.
Disclosure of Invention
The invention aims to provide theoretical basis and technical support for the problem of man-machine control right distribution when an intelligent vehicle is in a risk driving state. Specifically, in the manual driving mode of the intelligent vehicle, when a driver executes a risky driving behavior or a malicious driving behavior, the vehicle is in a continuous high-risk state, the automatic driving system needs to make a control right switching decision timely and effectively, the automatic driving system controls the vehicle at a reasonable time, and the risk potential caused by the bad driving behavior of the driver is avoided.
In order to achieve the purpose, the specific technical scheme of the man-machine driving sharing control right decision method based on the man-vehicle risk state is as follows:
the invention has the precondition that (1) the intelligent automobile is provided with a driver behavior monitoring module and a vehicle running risk state monitoring module, so that the risk levels of the driver behavior and the vehicle running state can be monitored and quantified in real time; (2) the intelligent automobile has high automatic driving capability and higher scope of action, and has the capability of coping with the risk driving scene.
The invention aims to consider the influence of factors such as the emotion, the psychological state, the driving style or the driving experience of a driver on the driving safety, when the driver intentionally or unintentionally continuously executes a risk driving task and prompts the vehicle risk to be continuously increased, an intelligent vehicle forcibly switches the control right to an automatic driving system under the condition that the early warning is ignored, and the risk level of the vehicle is timely reduced by using the automatic driving mode of the intelligent vehicle.
The invention aims to provide a man-machine driving sharing control right decision method. The method integrates the driver risk behavior monitoring information and the vehicle driving safety information of the intelligent vehicle and inputs the information into the decision-making model, the decision-making model outputs a decision-making result to a driver and a control system after comprehensive risk assessment, and the overall risk level of the vehicle is timely reduced by a proper early warning or control right switching mode.
The invention aims to construct a reinforcement learning model based on comprehensive safety analysis of driver behaviors and vehicle driving states, and further try to explore an intelligent automobile control right distribution decision method which takes human-vehicle risk states as evaluation basis and aims at reducing overall risk in a risk driving scene. The technical scheme adopted by the invention is as follows:
comprises the following steps which are carried out in sequence:
s1, establishing an enhanced learning reward and punishment mechanism based on the human-vehicle risk state game relation;
step S1-1, on the basis that the completed intelligent vehicle (intelligent agent) has the ability to predict the risk driving behavior state of the driver and the vehicle running risk state, processing the human-vehicle risk monitoring result by using a Markov decision process to make the human-vehicle risk monitoring result accord with the operation rule of the reinforcement learning algorithm;
step S1-2, aiming at the reward function setting problem in the reinforcement learning algorithm frame, taking the expected maximum utility theorem as a criterion and taking the maximization of the utility (the overall safety of the vehicle) as a target, providing a human-vehicle risk state game method based on the complete information static game;
step S1-3, using the relative distance of an Ideal point calculated by an approximate Ideal Solution sorting method (Technique for Order Preference by Similarity to Ideal Solution, TOPSIS) as a quantification means of strategy income, extracting the driving behavior characterization indexes and time margin data of other risk levels based on an index weight calculated by an Entropy weight method (Entropy), taking the driving behavior characterization indexes and the time margin of a risk-free driving state as negative Ideal points, and respectively calculating the relative distance taking the negative Ideal points as the reference to obtain a utility matrix considering the human-vehicle risk game relationship;
step S2, providing a man-machine co-driving control right decision method based on a reinforcement learning algorithm framework;
step S2-1, describing an environment interaction mode of the intelligent agent by means of time sequence characteristics of a Markov decision process, and embedding the human and vehicle risk monitoring result into a reinforcement learning algorithm framework;
step S2-2, a man-machine co-driving control right decision method based on a reinforcement learning algorithm framework is provided, coefficients (coefficients of man and vehicle risk weights and reward functions) and switching time of a model are traversed in a global optimization mode, so that the model obtains a relatively optimal decision output result, and optimal automatic system switching time is considered;
and finishing the construction of the whole decision model.
The specific construction steps are as follows:
firstly, establishing an enhanced learning reward and punishment mechanism based on a human-vehicle risk state game relation. Specifically, the risk states of people and vehicles are used as people in the stations to establish a game relation, and theoretically, the risk state prediction models of people and vehicles can advance along with time and calculate risk monitoring results at the same time in the driving process of the vehicles, so that the information of people in the two stations is known mutually and is complete information; in addition, the historical sequence of the prediction results does not change any more, which is a static game. In summary of the above discussion, the human-vehicle risk status gaming relationship of the present invention is a complete information static game.
Comprises the following steps:
a) the reward function of reinforcement learning is a key for guiding an intelligent agent to complete an expected target, and corresponding reward and punishment values are generally formulated according to the condition of achieving the target in the classical reinforcement learning task. Aiming at the problem of reward function setting in an A2C algorithm framework, the invention provides a human-vehicle risk state game method based on complete information static game, which considers the interaction relation between human and vehicle states on the basis that an intelligent agent has the capability of predicting the behavior state of a driver and the safety of a vehicle running state, takes the expected maximum utility theorem as a criterion and takes the maximum utility (the overall safety of the vehicle) as a target.
b) And providing a strategy utility function calculation method based on an entropy weight-TOPSIS method. In order to avoid the disorder of the scale caused by the non-uniform dimensionality of the index and ensure that the index is centered on the reference, an intermediate index processing method is adopted to carry out forward processing on the index. The relative distance of the ideal point calculated by the TOPSIS method is used as a quantitative means of strategy income, the driving behavior characterization indexes and the time margin of the risk-free driving state are taken as negative ideal points on the basis of the index weight calculated by the entropy weight method, the driving behavior characterization indexes and the time margin of other risk levels are extracted, and the relative distance with the negative ideal point as the reference is calculated respectively. The larger the relative distance between the driving state with different risk levels and the negative ideal point is, the better the revenue effect is, otherwise, the worse the revenue effect is. The specific calculation method is as follows:
the first step, a standardized assessment matrix is constructed, X is a raw data matrix, m is the dimension of the index, n is the number of the index, X'ijIs a standardized data.
Figure BDA0003181517470000041
Figure BDA0003181517470000042
And secondly, calculating the characteristic proportion of the index.
Figure BDA0003181517470000043
Thirdly, calculating the information entropy of the index,
Figure BDA0003181517470000044
pijis a special character of an indexProper proportion, eiIs the information entropy of the index.
Figure BDA0003181517470000045
The fourth step, weight calculation based on information entropy redundancy, wjIs the weight of the index.
Figure BDA0003181517470000051
And fifthly, performing intermediate type treatment on the index.
Figure BDA0003181517470000052
Figure BDA0003181517470000053
In the formula, xijIs the original data of the image data,
Figure BDA0003181517470000054
is the data after the intermediate regularization process. Using the characterization index of the risk-free driving grade as a control variable, and using xLAs a characterizing indicator of other risk classes.
Sixth, normalizing the initial matrix, zijIs the normalized value of the forward indicator, i.e., the norm of each column element divided by the current column vector.
Figure BDA0003181517470000055
Figure BDA0003181517470000056
Seventh, the relative distance between each risk level and the negative ideal point (
Figure BDA0003181517470000057
I.e., utility value).
Figure BDA0003181517470000058
Figure BDA0003181517470000059
In the formula, wjIs the entropy weight calculated by equation (5);
Figure BDA00031815174700000510
the relative distance between the negative ideal points of the driving state of each risk level is used for constructing a utility matrix of the human-vehicle risk game relation.
When constructing a utility matrix, setting rho and sigma as utility values of risk states of people and vehicles respectively, uhumant) And uvehiclet) Respectively are expected utility functions of a driving behavior risk state and a vehicle driving risk state, q is a strategy probability, and utility matrixes corresponding to the human-vehicle risk state grades are constructed and shown in a table 1:
table 1.
Figure BDA0003181517470000061
When the driving behavior risk state is RPi humanThe expected yield of the vehicle risk state is uvehiclei)=q·σi+(1-q)·σi+1(ii) a When the driving behavior risk state is
Figure BDA0003181517470000062
The expected yield of the vehicle risk state is uvehiclej)=q·σj+(1-q)·σj+1
In order to make the vehicle risk state in any driving behavior risk stateIf the state has stable utility (namely, balanced game state), the utility function U (sigma) is calculated according to the formula (12), wherein the utility value sigma isi、σi+1、σjAnd σj+1Calculated from equation (11).
Figure BDA0003181517470000063
The meaning of the optimal utility obtained by the complete information static equilibrium game is as follows: with the person-vehicle risk status at the current time and the next time as the person in the gaming bureau, it is contemplated that the vehicle risk studies targeted by the present invention are dominated by driver behavior. Therefore, whether the driving behavior risk level at the next moment develops towards a relatively higher or lower direction, the change of the risk state of the vehicle towards a relatively safer direction should be promoted at the next link, and the A2C reward function constructed by the utility function U (sigma) can promote the development of the next state towards a safer (i.e. higher utility) direction by using the uniform probability of the vehicle risk as a reference point in the form of reward or penalty for the Actor strategy.
And secondly, providing a man-machine driving sharing control right decision method based on a reinforcement learning algorithm framework. Specifically, the goal of reinforcement learning is to obtain reward signals through interaction with the environment, maximize the expectation of cumulative rewards in the future, finally learn a good strategy, when a certain action can positively affect the maximized rewards, the action is strengthened, and when the same state is met, the same action is selected again by the intelligent body; conversely, when an action brings a negative benefit, the action will be impaired. The method comprises the following specific steps:
a) the information output by the driver behavior risk prediction method and the vehicle risk prediction method provided by the invention are MDP problems, and in each time period t, the intelligent agent receives a vector s representing the environmental conditiontAnd according to stMaking action strategy, and generating action A under the guidance of the strategytBased on AtGenerating a prize r from a prize functiont+1The agent can be given at the next moment, and the agent iterates moreNew to st+1The process is repeated to form the interaction track tau ═ s of the intelligent body0,A0,R1,s1,A1,R2,…,sn,An,Rn+1]The complete procedure for MDP is shown in fig. 2. At any time t, the goal of the agent is to maximize the total reward GtOptimization of expectation E (G) in this studyt) I.e. to maximize vehicle safety.
Figure BDA0003181517470000071
The attenuation coefficient gamma is used for distinguishing the importance of the instant reward and the future reward, gamma is more than or equal to 0 and less than or equal to 1, and when gamma is close to 0, the intelligent agent pays more attention to the current instant reward; when γ is close to 1, the agent is more focused on future rewards to make the decision; rkIs reward (reward) for time k.
The intelligent agent and the environment are two parts essential to the reinforcement learning model, the intelligent agent extracts the information in the environment and then outputs an action strategy to execute the action AtRear reception of the reward signal Rt(ii) a The environment receiving agent sends action At+1Then outputs the reward signal R in combination with the next observation statet+1The process is shown in fig. 3.
b) The whole vehicle safety information pool (composed of human and vehicle state parameters) is used as a reinforcement learning environment, and the intelligent agent extracts data with time-series characteristics from the environment and carries out iterative computation. Taking the driver risk evaluation level as an example, the intelligent agent needs to acquire the following characteristics: (1) regularized risk driving behavior characterization index [ p ]1,p2,p3,…,pn](ii) a (2) The risk assessment rank sequence is [ r ]1,r2,…rn]This sequence is collected by a risk driving behavior prediction model. Wherein, the output of the time risk driving behavior prediction model is recorded as RPt humanVehicle risk rating denoted RPt vehicleThe relationship between the three risks is as follows:
RP=α·RPhuman+β·RPvehicle (14)
in the formula, alpha and beta are decision weights corresponding to human and vehicle states, alpha belongs to [0,1], beta belongs to [0,1], and the attention degree of a decision model to vehicle risks in the vehicle driving process is higher than the risks of driver behaviors, so that the value of alpha is less than beta, the value of alpha and beta determines the final decision result effect, and the optimal value of alpha and beta is iteratively searched by observing the decision effect.
c) Taking the dominant actor critic algorithm of reinforcement learning as an example, a decision model is established, and the internal architecture of the model is shown in fig. 4.
The model contains two modules: (1) the output of an Actor (Actor) network is a man-machine co-driving control right decision result, namely, the action vector of an action space comprises decisions on the behavior of a driver and the risk state of a vehicle respectively; (2) a Critic (Critic) network is used to judge the effect of the decision results in a given environment. The two network modules adopt an LSTM neural network to process the sequence characteristics extracted by the intelligent agent; the Leaky Rectifying Linear Unit is used as an activation function, the activation functions of the Actor network and the Critic network are respectively expressed as formulas (15) and (16), and the principle and the feature extraction method of the algorithm are respectively shown as fig. 5 and 6.
Figure BDA0003181517470000081
Figure BDA0003181517470000082
d) Introducing monitoring results of the risk driving behavior prediction model and the vehicle risk level prediction model (namely the driver behavior risk level and the vehicle running risk level) into a Critic network, and training an A2C model; the goal of an Actor network is to maximize the objective function.
J(θ)=E[logπ(A|s,θ)·Aadv(s,A)] (17)
Aadv(st,At)=Rt+γ·V(st+1|w)-V(st+1|w) (18)
Wherein, Aadv(s, A) is an advantage function, in order to calculate the advantage function, V (s | w) is calculated by using a Critic network, the Critic network is optimized by using a method of minimizing a time difference error (TD error), the TD error is a good action which indicates that the strategy output by the Actor is a good action, and is a bad action on the contrary, and the Actor adjusts the strategy of the next round according to the information.
TDerror=r+γ·V(st+1)-V(st) (19)
J(w)=[Rt+γ·V(st+1|w)-V(st|w)]2 (20)
e) The intelligent agent extracts the characteristic variables of the environment space to the action space, and makes a decision result for reducing the risk according to the risk state of people and vehicles
Figure BDA0003181517470000091
The decision result is a continuous value from-1 to 1, wherein the closer to 1 represents that the intelligent agent supports maintaining the current driving state, and otherwise, the intelligent agent indicates that measures such as early warning or control right switching should be taken.
f) The expected utility function is adopted to measure the effect of comprehensive driving safety, such as formula (21), U (sigma)t) The utility function is a utility function supported by the intelligent agent reward function and is calculated by the formula (12). Converting the formula into a reward function R for reinforcement learningrewardAs in equation (22).
E[U(σt)]=E[U(σ0+∑δ·σt)] (21)
Figure BDA0003181517470000092
In the formula, mu is equal to [0,1]]Is the coefficient of the reward function, is one of the key parameters for adjusting the decision effect, and the value of mu determines the reward RrewardThe magnitude of (d); Δ RPt humanAnd Δ RPt vehicleThe difference between the current risk level of the person and the current risk level of the vehicle and the previous state is respectively obtained.
The man-machine driving sharing control right decision method based on the man-vehicle risk state has the following advantages: aiming at the conditions that a driver executes risky driving behaviors or malicious driving behaviors and the like, the intelligent vehicle can take over the vehicle control right by the automatic driving system in time on the basis of monitoring the risk states of the driver and the vehicle, so that further loss caused by human factors is avoided.
Drawings
Fig. 1 shows the general technical scheme of the invention.
Fig. 2 is a diagram of a markov chain for reinforcement learning according to the present invention.
FIG. 3 is an internal schematic diagram of reinforcement learning according to the present invention.
Fig. 4 is a human-machine co-driving control right decision model architecture based on reinforcement learning (actor critic algorithm) of the invention.
Fig. 5 is A2C algorithm architecture.
Fig. 6 shows a feature extraction method of the algorithm.
FIG. 7 is a schematic diagram of the partitioning of a data set.
FIG. 8 is a diagram illustrating the variation of accumulated rewards in embodiment 1.
FIG. 9 is a graph showing the change in the loss rate in example 1.
Fig. 10 shows the control right decision results corresponding to different risk status levels of people and vehicles.
Fig. 11 is a schematic diagram of risk trend before and after control right switching.
Detailed Description
In order to better understand the purpose, structure and function of the present invention, the man-machine driving sharing control right decision method based on the human-vehicle risk state of the present invention is further described in detail with reference to the drawings.
The invention provides a control right decision method based on a man-vehicle risk state, aiming at the problems of control right decision and switching opportunity selection of a man-machine common driving mode of an intelligent automobile, and the control right decision method can effectively output a reasonable decision result and timely request an automatic driving system to take over the control right, so that potential safety hazards caused by dangerous driving behaviors of a driver are fundamentally controlled. Firstly, aiming at promoting the safety maximization of an intelligent automobile, a human and automobile risk game model is established by using a TOPSIS method and a complete static game theory, a strategy function with maximized relative utility is provided and embedded into a reinforcement learning reward function, and a reinforcement learning reward and punishment mechanism which is oriented to maximize the safety expectation of the automobile is reasoned. Secondly, by utilizing the advantage that the reinforcement learning algorithm is good at solving the sequence decision problem, a man-machine driving sharing control right decision method based on the A2C algorithm is provided, the output effect of the decision model is optimized by adjusting the man-vehicle risk decision weight and the reward function, the effectiveness of the training process and the result is verified by using the model performance evaluation index, the influence of the switching opportunity on the vehicle safety is analyzed through a simulation test, and the control right decision method capable of timely and effectively limiting the driver risk behavior and improving the vehicle safety is provided.
Example 1: model performance verification
The algorithm test is implemented in a high-performance computer and is specifically configured as follows: i7-7700K CPU, NVIDIA GTX _1080Ti GPU, 32G memory. Based on the tensoflow-1.15.4 compilation environment, the maximum reward is set to 500. The Actor network and Critic network of the model employ LSTM, and the activation function uses leakage ReLU. The training parameters include: the total iteration is 1000epoch, batch _ size is 64, Actor model learning rate is 0.0005, Critic model learning rate is 0.001, and the progressive reward parameter gamma is 0.99. The model-dependent library (package) files include numpy 1.18.1, pandas 1.0.1, scinit-spare 0.22.2.post1, gym-0.15.3, joblib-0.15.1, matplotlib 3.2.2.
Table 2.
Figure BDA0003181517470000111
The effectiveness of the model was evaluated using 4 evaluation indices: (1) returning, geometric averaging of the iterative cycle gains; (2) accumulating the return, and accumulating the sum of the returns in the training process; (3) standard deviation of periodic fluctuation, iterative periodic total profit fluctuation; (4) maximum loss rate, maximum loss rate during the test. The obtained results are shown in Table 3, and the A2C algorithm is superior to the PPO algorithm in the aspects of the income index, the volatility index and the maximum withdrawal rate, and the superiority of the model is shown.
TABLE 3 Algorithm index comparison
Figure BDA0003181517470000121
After modeling is completed, a retest effect graph of the model is observed, fig. 8 and 9 are graphs of accumulated return and loss rate of the model, and it can be seen from curve trend that accumulated return is gradually increased and stability is improved, overall loss rate is gradually reduced, return process is steadily increased, and the overall goal of maximizing vehicle safety is met. In general, the algorithm model based on A2C proposed herein performs well and can follow a decision process with the goal of maximizing vehicle safety.
Extracting characteristic variables of the environment space to an action space by a decision model of the intelligent agent, and making an intervention decision result for reducing the total risk RP at the moment t according to the risk states of people and vehicles
Figure BDA0003181517470000122
Figure BDA0003181517470000123
A closer to 1 indicates that the agent supports maintaining the current driving state, and vice versa indicates an object to maintain the current control state (i.e., making a reminder or intervention measure, switching control authority, etc.); and then researching and testing the decision effect under different model parameters in a selected scene to finally obtain the relatively optimal model parameters to complete the establishment of the model.
Specifically, the risk levels of human and vehicle states are extracted through a driving behavior risk monitoring module and a vehicle driving risk monitoring module and input into the human-computer driving control right decision model provided by the invention, the human-computer driving control right decision model takes the human and vehicle risk state discrimination result as the reference standard of game and reward punishment mechanisms, and the whole decision is calculated
Figure BDA0003181517470000124
The value is obtained.
To facilitate statistical decisions
Figure BDA0003181517470000126
Describing the distribution condition of the values and the intuition, and making a decision
Figure BDA0003181517470000125
The values are divided into four intervals: [ -1, -0.5), [ -0.5,0), [0,0.5), [0.5,1]And the classification is used as the classification of different decision levels.
The model adjustable parameter is in accordance with the condition alpha epsilon [0,1]、β∈[0,1]And μ e [0,1]]Setting initial values and calculating decisions by the dichotomy idea
Figure BDA0003181517470000131
And (4) iteratively optimizing the alpha, the beta and the mu according to the decision effect. The decision weight alpha and the decision weight beta are proportional to each other, so that the decision weight alpha is less than the decision weight beta, and the decision model can output a decision result which mainly takes the risk state of the vehicle and gives consideration to the behavior of a driver as the risk state. The coefficient μ of the reward function, on the other hand, determines the magnitude of the reward function output, which can be adjusted by adjusting this parameter. Setting initial alpha, beta and mu, traversing parameters and outputting results for multiple times to finally obtain a relatively ideal decision model, and performing demonstration analysis on decision effects in three driving scenes.
And selecting several typical decision results from the test results of the following scenes as cases for discussion. Fig. 10 is a diagram showing a decision result distribution when α is 0.2, β is 0.8, and μ is 0.5, and the decision results are represented by four kinds of icons
Figure BDA0003181517470000132
Interval of value, size of icon representing the position
Figure BDA0003181517470000133
The distribution situation of the decision results corresponding to the risk level states of people and vehicles can be intuitively expressed in the graph.
Specifically, the horizontal axis of fig. 10 is the vehicle risk level, the vertical axis is the risk level of the driver behavior, and the average is 1-6 levels, where 1 level is the highest risk and 6 levels are the lowest risk.
Finally, a decision result corresponding to each parameter combination is obtained through a traversal mode, the decision result when alpha is 0.5, beta is 0.8 and mu is 0.8 is selected as a final model parameter, the decision effect is as shown in fig. 10, and the alpha and beta values realize the effect that the decision effect is mainly based on vehicle risks and also consider the risk state of the driving behavior, so that the decision value distribution is ideal, and the sensitivity is relatively better.
Further, the feasibility of the established human-computer driving sharing control right decision model is used for testing, monitoring results (namely human and vehicle risk levels) of the driving behavior risk monitoring module and the vehicle driving risk monitoring module are input into the decision model, and the decision model outputs control right decision results at different human and vehicle risk levels at any time, as shown in fig. 11. Fig. 11(a) and 11(b) show the visual risk level trends of the person and the vehicle, and when the risk of the person and the vehicle is in a high level, it can be seen from fig. 11(c) that the control right switching decision result is that the automatic driving system takes over the control right of the vehicle, and the switched vehicle risk level is timely suppressed, and the safety is significantly improved (fig. 11 (b)).
The man-machine driving sharing control right switching method provided by the invention is characterized in that under the special condition that a vehicle is at a high risk, an automatic driving system forcibly takes over the driving right, and after the safety state of the vehicle is recovered by using an automatic driving function, a driver can still switch the control right according to the will.
To make a decision
Figure BDA0003181517470000141
The four intervals of the value are respectively used as different early warning levels, and through summarizing the decision result and the human and vehicle risk distribution characteristics, the method can lead the decision result to be different from the human and vehicle risk distribution characteristics
Figure BDA0003181517470000142
Defining a control right switching instruction as a high-risk driving state
Figure BDA0003181517470000143
And
Figure BDA0003181517470000144
respectively, medium risk early warning, low risk early warning and no early warning.
After an ideal decision model is obtained, effectiveness test of reducing driving risks after switching control rights is further explored at a switching opportunity, results corresponding to different switching opportunities are traversed through global optimization, a relatively ideal switching effect is shown in fig. 11, the decision model outputs a reasonable decision value at each individual and vehicle risk state stage, a switching instruction is timely output when a vehicle is at a high risk, vehicle risks after the system takes over are timely reduced, overall, the control rights are effectively and timely switched to an automatic driving system under the condition that the individual and vehicle states are at the high risk, and safety of the vehicle is improved.
It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims (8)

1. A man-machine driving sharing control right decision method based on a man-vehicle risk state is characterized by comprising the following steps which are sequentially carried out:
s1, establishing an enhanced learning reward and punishment mechanism based on the human-vehicle risk state game relation;
s1-1, on the basis that the completed intelligent vehicle has the capability of predicting the dangerous driving behavior state of the driver and the running risk state of the vehicle, processing the human-vehicle risk monitoring result by using a Markov decision process to enable the human-vehicle risk monitoring result to accord with the operation rule of a reinforcement learning algorithm;
step S1-2, aiming at the reward function setting problem in the reinforcement learning algorithm frame, taking the expected maximum utility theorem as a criterion and taking the maximum utility as a target, providing a people-vehicle risk state game method based on the complete information static game;
s1-3, using the relative distance of the ideal point calculated by an approximate ideal solution sorting method as a quantification means of strategy income, extracting the driving behavior characterization indexes and time margin data of other risk levels by taking the driving behavior characterization indexes and time margins of a risk-free driving state as negative ideal points based on the index weight calculated by an entropy weight method, and respectively calculating the relative distance by taking the negative ideal points as a reference to obtain a utility matrix considering the human-vehicle risk game relationship;
step S2, providing a man-machine co-driving control right decision method based on a reinforcement learning algorithm framework;
step S2-1, describing an environment interaction mode of the intelligent agent by means of time sequence characteristics of a Markov decision process, and embedding the human and vehicle risk monitoring result into a reinforcement learning algorithm framework;
step S2-2, taking the reinforcement learning reward and punishment mechanism of the step S1 as a reward or punishment reference standard of decision, proposing a man-machine co-driving control right decision method based on a reinforcement learning algorithm framework, traversing the decision coefficient and the switching time of the model in a global optimization mode, enabling the model to obtain a relatively optimal decision output result, and considering the optimal automatic system switching time;
and finishing the construction of the whole decision model.
2. The human-computer co-driving control authority decision method based on the human-vehicle risk state as claimed in claim 1, wherein the step S1 specifically comprises the following steps, and the following steps are sequentially performed:
the first step is to construct a standardized assessment matrix, wherein X is a raw data matrix, m is the dimensionality of the indexes, n is the number of the indexes, and X'ijIs a standardized data;
Figure FDA0003678570290000021
Figure FDA0003678570290000022
secondly, calculating the characteristic proportion of the indexes;
Figure FDA0003678570290000023
thirdly, calculating the information entropy of the index,
Figure FDA0003678570290000024
pijis a characteristic ratio of the index, eiIs the information entropy of the index;
Figure FDA0003678570290000025
the fourth step, weight calculation based on information entropy redundancy, wjIs the weight of the index;
Figure FDA0003678570290000026
fifthly, performing intermediate processing on the index;
Figure FDA0003678570290000027
Figure FDA0003678570290000028
in the formula, xijIs the original data of the image data,
Figure FDA0003678570290000029
the data after the intermediate regularization processing takes the characterization index of the risk-free driving grade as a control variable and xLAs a characterization indicator of other risk classes;
sixth step, normalizing the initial matrix, zijIs the normalized value of the forward indicator, i.e., the norm of each column element divided by the current column vector;
Figure FDA0003678570290000031
Figure FDA0003678570290000032
seventhly, calculating the relative distance between each risk level and the negative ideal point;
Figure FDA0003678570290000033
Figure FDA0003678570290000034
in the formula, wjIs the entropy weight calculated by equation (5);
Figure FDA0003678570290000035
the relative distance between the negative ideal points of the driving state of each risk level is used for constructing a utility matrix of the human-vehicle risk game relation.
3. The human-computer driving sharing control right decision method based on human-vehicle risk state as claimed in claim 2, wherein when the utility matrix of the human-vehicle risk game relation is constructed in the seventh step, p and σ are respectively set as the effects of the human-vehicle risk state and the vehicle risk stateBy the value uhumant) And uvehiclet) Respectively an expected utility function of a driving behavior risk state and a vehicle running risk state, and q is a strategy probability;
when the driving behavior risk state is RPi humanThe expected yield of the vehicle risk state is uvehiclei)=q·σi+(1-q)·σi+1(ii) a When the driving behavior risk state is
Figure FDA0003678570290000036
The expected yield of the vehicle risk state is uvehiclej)=q·σj+(1-q)·σj+1
In order to ensure that the vehicle risk state has stable utility in any driving behavior risk state, the utility function U (sigma) is calculated according to the formula (12), wherein the utility value sigma isi、σi+1、σjAnd σj+1Calculated by formula (11);
Figure FDA0003678570290000037
4. the human-computer co-driving control authority decision method based on the human-vehicle risk state as claimed in claim 3, wherein the step S2 specifically comprises the following steps, and the following steps are sequentially performed:
first, at each time period t, the agent receives a vector s representing the environmental situationtAnd according to stMaking action strategy, and generating action A under the guidance of the strategytBased on AtGenerating a prize r from a prize functiont+1The agent will be given the next moment and the agent will iteratively update to st+1The process is repeated to form the interaction track tau ═ s of the intelligent body0,A0,R1,s1,A1,R2,…,sn,An,Rn+1]At any time t, the goal of the agent is to maximizeTotal reward GtOptimization of expectation E (G)t) I.e. to maximize vehicle safety;
Figure FDA0003678570290000041
the attenuation coefficient gamma is used for distinguishing the importance of the instant reward and the future reward, gamma is more than or equal to 0 and less than or equal to 1, and when gamma is close to 0, the intelligent agent pays more attention to the current instant reward; when γ is close to 1, the agent is more focused on future rewards to make the decision; r iskIs the reward at time k;
secondly, taking the vehicle overall safety information pool as a reinforcement learning environment, and extracting data with time-sequence characteristics from the environment by the intelligent agent to perform iterative computation;
thirdly, taking the algorithm of the dominant actor critics for reinforcement learning as an example, establishing a decision model;
fourthly, the intelligent agent extracts characteristic variables of the environment space to the action space, and makes decision results for reducing risks according to the risk states of people and vehicles
Figure FDA0003678570290000042
The decision result is a continuous value from-1 to 1, the closer to 1 represents that the intelligent agent supports maintaining the current driving state, and otherwise, the intelligent agent indicates that measures such as early warning or control right switching and the like should be taken;
fifthly, measuring the effect of comprehensive driving safety by adopting an expected utility function, such as formula (21), U (sigma)t) The utility function is a utility function supported by the intelligent reward function and is calculated by a formula (12), and the formula is converted into a reward function R for reinforcement learningrewardAs in equation (22);
E[U(σt)]=E[U(σ0+∑δ·σt)] (21)
Figure FDA0003678570290000043
in the formula, mu e[0,1]Is the coefficient of the reward function, is one of the key parameters for adjusting the decision effect, and the value of mu determines the reward RrewardThe magnitude of (d); Δ RPt humanAnd Δ RPt vehicleThe difference between the current risk level of the person and the current risk level of the vehicle and the previous state is respectively obtained.
5. The human-computer driving-together control right decision method based on the human-vehicle risk state as claimed in claim 4, wherein in the second step, the intelligent agent needs to collect the following characteristics:
regularized risk driving behavior characterization index [ p ]1,p2,p3,…,pn];
The risk assessment rank sequence is [ r ]1,r2,…rn]The sequence is collected by a risk driving behavior prediction model;
the output of the moment risk driving behavior prediction model is recorded as RPt humanVehicle risk rating denoted RPt vehicleAnd RP is the comprehensive risk of the vehicle, and the relationship among the RP, the comprehensive risk of the vehicle and the risk is as follows:
RP=α·RPhuman+β·RPvehicle (14)
in the formula, alpha and beta are decision weights corresponding to human and vehicle states, alpha belongs to [0,1], beta belongs to [0,1], and the attention degree of a decision model to vehicle risks in the vehicle driving process is higher than the risks of driver behaviors, so that the value of alpha is less than beta, and the final decision result effect is determined by the values of alpha and beta.
6. The human-computer co-driving control right decision method based on the human-vehicle risk state is characterized in that the decision model in the third step comprises two modules;
the module is that the output of the actor network is a decision result of the man-machine co-driving control right, namely, the action vector of the action space contains decisions on the behavior of a driver and the risk state of a vehicle respectively;
and the second module is used for judging the effect of the decision result in a given environment by using a critic network.
7. The human-computer co-driving control right decision method based on the human-vehicle risk state as claimed in claim 6, wherein the modules I/II adopt an LSTM neural network to process the sequence features extracted by the intelligent agent; adopting a Leaky Rectifying Linear Unit as an activation function, the activation functions of the Actor network and the Critic network are respectively expressed as formulas (15) and (16):
Figure FDA0003678570290000051
Figure FDA0003678570290000052
8. the human-computer co-driving control right decision method based on the human-vehicle risk state is characterized in that the monitoring results of a risk driving behavior prediction model and a vehicle risk level prediction model are introduced into a Critic network to train an A2C model; the goal of an Actor network is to maximize the objective function:
J(θ)=E[logπ(A|s,θ)·Aadv(s,A)] (17)
Aadv(st,At)=Rt+γ·V(st+1|w)-V(st+1|w) (18)
wherein A isadv(s, A) is an advantage function, in order to calculate the advantage function, V (s | w) is calculated by using a Critic network, the Critic network is optimized by using a method of minimizing a time sequence difference error, TD error is a good action which indicates that the strategy output by the Actor is a bad action, and the Actor adjusts the strategy of the next round according to the information;
TDerror=r+γ·V(st+1)-V(st) (19)
J(w)=[Rt+γ·V(st+1|w)-V(st|w)]2 (20)。
CN202110848303.9A 2021-07-27 2021-07-27 Man-machine driving-sharing control right decision method based on man-vehicle risk state Active CN113335291B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110848303.9A CN113335291B (en) 2021-07-27 2021-07-27 Man-machine driving-sharing control right decision method based on man-vehicle risk state

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110848303.9A CN113335291B (en) 2021-07-27 2021-07-27 Man-machine driving-sharing control right decision method based on man-vehicle risk state

Publications (2)

Publication Number Publication Date
CN113335291A CN113335291A (en) 2021-09-03
CN113335291B true CN113335291B (en) 2022-07-08

Family

ID=77480389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110848303.9A Active CN113335291B (en) 2021-07-27 2021-07-27 Man-machine driving-sharing control right decision method based on man-vehicle risk state

Country Status (1)

Country Link
CN (1) CN113335291B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763723B (en) * 2021-09-06 2023-01-17 武汉理工大学 Traffic signal lamp control system and method based on reinforcement learning and dynamic timing
CN113911140B (en) * 2021-11-24 2022-09-27 无锡物联网创新中心有限公司 Man-machine co-driving control method based on non-cooperative game and related device
CN114141040B (en) * 2021-11-30 2023-02-21 燕山大学 Signal lamp passing redundancy system used in intelligent networked vehicle cruise mode
CN115071758B (en) * 2022-06-29 2023-03-21 杭州电子科技大学 Man-machine common driving control right switching method based on reinforcement learning
CN115494849A (en) * 2022-10-27 2022-12-20 中国科学院电工研究所 Navigation control method and system for automatic driving vehicle
CN117227834B (en) * 2023-11-10 2024-01-30 中国矿业大学 Man-machine cooperative steering control method for special vehicle

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109976340A (en) * 2019-03-19 2019-07-05 中国人民解放军国防科技大学 Man-machine cooperation dynamic obstacle avoidance method and system based on deep reinforcement learning
KR20190109720A (en) * 2019-09-06 2019-09-26 엘지전자 주식회사 Method and apparatus for driving guide of vehicle
CN112465151A (en) * 2020-12-17 2021-03-09 电子科技大学长三角研究院(衢州) Multi-agent federal cooperation method based on deep reinforcement learning
CN112721943A (en) * 2021-01-20 2021-04-30 吉林大学 Man-machine co-driving transverse control method with conflict resolution function
CN113200056A (en) * 2021-06-22 2021-08-03 吉林大学 Incomplete information non-cooperative game man-machine co-driving control method
CN113602284A (en) * 2021-07-30 2021-11-05 东风柳州汽车有限公司 Man-machine common driving mode decision method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11460842B2 (en) * 2017-08-28 2022-10-04 Motional Ad Llc Mixed-mode driving of a vehicle having autonomous driving capabilities
US10845815B2 (en) * 2018-07-27 2020-11-24 GM Global Technology Operations LLC Systems, methods and controllers for an autonomous vehicle that implement autonomous driver agents and driving policy learners for generating and improving policies based on collective driving experiences of the autonomous driver agents

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109976340A (en) * 2019-03-19 2019-07-05 中国人民解放军国防科技大学 Man-machine cooperation dynamic obstacle avoidance method and system based on deep reinforcement learning
KR20190109720A (en) * 2019-09-06 2019-09-26 엘지전자 주식회사 Method and apparatus for driving guide of vehicle
CN112465151A (en) * 2020-12-17 2021-03-09 电子科技大学长三角研究院(衢州) Multi-agent federal cooperation method based on deep reinforcement learning
CN112721943A (en) * 2021-01-20 2021-04-30 吉林大学 Man-machine co-driving transverse control method with conflict resolution function
CN113200056A (en) * 2021-06-22 2021-08-03 吉林大学 Incomplete information non-cooperative game man-machine co-driving control method
CN113602284A (en) * 2021-07-30 2021-11-05 东风柳州汽车有限公司 Man-machine common driving mode decision method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于人-车风险状态的人机共驾控制权决策方法;郭柏苍 等;《中国公路学报》;20220331;第153-165页 *

Also Published As

Publication number Publication date
CN113335291A (en) 2021-09-03

Similar Documents

Publication Publication Date Title
CN113335291B (en) Man-machine driving-sharing control right decision method based on man-vehicle risk state
CN111915059B (en) Attention mechanism-based Seq2Seq berth occupancy prediction method
CN111695731B (en) Load prediction method, system and equipment based on multi-source data and hybrid neural network
Lin et al. An ensemble learning velocity prediction-based energy management strategy for a plug-in hybrid electric vehicle considering driving pattern adaptive reference SOC
CN114015825B (en) Method for monitoring abnormal state of blast furnace heat load based on attention mechanism
CN116341901B (en) Integrated evaluation method for landslide surface domain-monomer hazard early warning
CN112101684A (en) Plug-in hybrid electric vehicle real-time energy management method and system
CN113255900A (en) Impulse load prediction method considering improved spectral clustering and Bi-LSTM neural network
CN112150304A (en) Power grid running state track stability prejudging method and system and storage medium
CN113902129A (en) Multi-mode unified intelligent learning diagnosis modeling method, system, medium and terminal
CN112949931A (en) Method and device for predicting charging station data with hybrid data drive and model
CN114548494B (en) Visual cost data prediction intelligent analysis system
CN115376103A (en) Pedestrian trajectory prediction method based on space-time diagram attention network
CN112036598A (en) Charging pile use information prediction method based on multi-information coupling
CN116542701A (en) Carbon price prediction method and system based on CNN-LSTM combination model
CN111080000A (en) Ultra-short term bus load prediction method based on PSR-DBN
CN114580262A (en) Lithium ion battery health state estimation method
CN116662815B (en) Training method of time prediction model and related equipment
CN117437507A (en) Prejudice evaluation method for evaluating image recognition model
CN117406100A (en) Lithium ion battery remaining life prediction method and system
CN116644562B (en) New energy power station operation and maintenance cost evaluation system
CN116946183A (en) Commercial vehicle driving behavior prediction method considering driving capability and vehicle equipment
Wu et al. Transformer-based traffic-aware predictive energy management of a fuel cell electric vehicle
CN115471009A (en) Predictive optimized power system planning method
Zhang et al. A Comparative Study of Vehicle Velocity Prediction for Hybrid Electric Vehicles Based on a Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant