CN109752952B

CN109752952B - Method and device for acquiring multi-dimensional random distribution and strengthening controller

Info

Publication number: CN109752952B
Application number: CN201711091328.9A
Authority: CN
Inventors: 陈晨; 钱俊
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-11-08
Filing date: 2017-11-08
Publication date: 2022-05-13
Anticipated expiration: 2037-11-08
Also published as: CN109752952A

Abstract

A method of obtaining a multi-dimensional random distribution for reinforcing a controller model, the method comprising: acquiring historical driving data, wherein the historical driving data comprises bottom layer control instructions for executing target basic actions; processing the historical driving data to obtain a plurality of control instruction samples, wherein each control instruction sample in the plurality of control instruction samples is a control instruction sequence which is formed by bottom layer control instructions with time sequence and is used for executing the target basic action; obtaining an average control instruction according to the plurality of control instruction samples, wherein the average control instruction is used for indicating a control instruction sequence which is composed of bottom layer control instructions with control values at an average level and with time sequence and is used for executing the target basic action; and obtaining multi-dimensional random distribution according to the average control instruction and the plurality of control instruction samples, wherein the multi-dimensional random distribution is an expected function distribution which is disturbed in a certain range around the average control instruction.

Description

Method and device for acquiring multi-dimensional random distribution and strengthening controller

Technical Field

The application relates to the field of automatic driving of automobiles, in particular to a method and a device for controlling unmanned driving.

Background

In the field of automatic driving of automobiles, how to make an unmanned vehicle perform some basic actions like a human driver faces huge challenges, such as the basic actions of going straight along a road axis, changing lanes, overtaking, following, parking, and the like, and the task of controlling the unmanned vehicle to perform the basic actions is called motion planner in the field of unmanned vehicle planning control.

Because the road conditions in reality are complex and changeable, and the bottom control instruction needs to be very fine, how to meet the safety requirement, have flexible road condition strain capacity and ensure that the motion planer can be finished stably and comfortably is the technical difficulty. The prior art can be classified into the following three types: the three prior arts have limitations and cannot well complete motion planner at present based on behavior decision of rules, constraint-based scheme and learning-based scheme. The technical scheme of the rule-based behavior decision generally assumes that a control instruction is designed based on the rule under the condition that other vehicles on the road are all at constant speed, and the control instruction cannot be flexibly responded when facing complex road conditions, so that the application scene is relatively single. Constraint-based solutions, such as a state-space and optimization based approach, typically require accurate models of the vehicle and the surrounding environment, which are difficult to obtain and very complex in practice; if an approximate model is used, the effect of the technical scheme is limited due to errors and uncertainty of the model.

Most of the technical schemes based on learning, especially reinforcement learning, are directed at high-level decisions and decisions about what tasks to execute at what time, and control instructions for generating a bottom layer are very limited because random distribution capable of improving the convergence rate of a model is lacked for free exploration, the training difficulty is very high, and the robustness is very poor.

Disclosure of Invention

In order to solve the technical problems in the prior art, the application provides a method for obtaining multi-dimensional random distribution and reinforcement learning.

In a first aspect, the present application provides a method for obtaining a multidimensional stochastic distribution, the multidimensional stochastic distribution being used for enhancing a controller model, comprising: acquiring historical driving data, wherein the historical driving data comprises a bottom layer control instruction for executing a target basic action, the target basic action is any one of lane changing, overtaking, car following, parking and straight going along a road axis, and the bottom layer control instruction comprises one or more of an acceleration parameter, a corner parameter and a brake parameter; processing the historical driving data to obtain a plurality of control instruction samples, wherein each control instruction sample in the plurality of control instruction samples is a control instruction sequence which is formed by bottom layer control instructions with time sequences and is used for executing the target basic action; obtaining an average control instruction according to the plurality of control instruction samples, where the average control instruction is used to indicate a control instruction sequence composed of bottom layer control instructions with control values at an average level and with time sequence and used for executing the target basic action, where the control value of the average control instruction is obtained by calculating an average value of the control values of the bottom layer control instructions corresponding to all the control instruction samples, for example: the control values of the bottom layer control instruction of the first time sequence of the 5 control instruction samples are respectively as follows: 5. 3, 5 and 4, and the average value is 4, so that the bottom layer control instruction of the first time sequence of the average control instruction is 4; and obtaining multi-dimensional random distribution according to the average control instruction and the plurality of control instruction samples, wherein the multi-dimensional random distribution is an expected function distribution which is disturbed in a certain range around the average control instruction. Optionally, the historical driving data may be obtained during driving of the vehicle, may also be obtained during simulated driving, or may also be obtained from public data sources or third parties. The multi-dimensional random distribution obtained according to historical driving data is used as a free exploration strategy for strengthening the controller model, so that the intuition of human learning is met, and the exploration space in the process of training the controller model can be reduced, so that the convergence speed of the controller model is improved.

In a possible implementation manner of the first aspect, before obtaining an average control instruction according to the plurality of control instruction samples, the method further includes: counting an instruction length of a control instruction sequence of each control instruction sample in the plurality of control instruction samples, wherein the instruction length is used for indicating the number of bottom-layer control instructions with time sequence; calculating an average instruction length according to the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples, wherein the average instruction length is used for indicating an average value, a median or a maximum value obtained according to the instruction length of the control instruction sequence in each control instruction sample in the plurality of control instruction samples; processing an instruction length of a sequence of control instructions for each control instruction sample of the plurality of control instruction samples into the average instruction length; obtaining the average control instruction according to the plurality of control instruction samples after instruction length processing, wherein the instruction length of the average control instruction is the average instruction length; the obtaining of the multidimensional random distribution according to the average control instruction and the plurality of control instruction samples includes: and obtaining the multi-dimensional random distribution according to the average control instruction with the instruction length being the average instruction length and the plurality of control instruction samples processed by the instruction length. Because the obtained driving data are various, the instruction lengths of the control instruction samples for executing a certain basic action obtained from the driving data are different, and the instruction lengths of the control instruction samples with different instruction lengths can be processed to be the same length by a technical means of growing or clipping, for example, the instruction lengths of all the control instruction samples are processed to be an average value, or a median, or a maximum value, and the processing to be the average value is taken as an example: the instruction lengths of the 5 control instruction samples are respectively: 12. 15, 12, 10, 11, the average value is 12, then the instruction length of all control instruction samples can be processed into 12 by the technical means of growing or clipping, and further, the instruction length of the average control instruction obtained according to the control instruction sample with the instruction length of 12 is also 12. By processing the instruction lengths of the control instruction samples to the same length, a better correspondence can be made to calculate the average control instruction.

In a possible implementation manner of the first aspect, the control instruction sequence of each of the plurality of control instruction samples includes at least one control instruction string, each control instruction string in the at least one control instruction string is composed of bottom control instructions of the same type with time sequences, the number of the bottom control instructions of each control instruction string in the at least one control instruction string is equal and the time sequences correspond to each other, and the instruction length is used to indicate the number of the bottom control instructions of any control instruction string in the at least one control instruction string.

According to the method for obtaining the multi-dimensional random distribution, the obtained multi-dimensional random distribution is used for strengthening the controller model, and the exploration space in the process of designing the controller model can be reduced, so that the convergence speed of the controller model is improved.

In a second aspect, the present application provides a method of reinforcing a controller model, which is a process of reinforcing learning, the method including: generating a first control instruction according to a multi-dimensional random distribution, wherein the multi-dimensional random distribution is obtained through the first aspect or any possible implementation manner of the first aspect, and a specific obtaining method refers to the first aspect or any possible implementation manner of the first aspect; acquiring current road condition state data and inputting the current road condition data into the controller model to generate a second control instruction; determining the first control instruction or the second control instruction as an actual control instruction, wherein the actual control instruction is used for controlling a target vehicle to execute a target basic action, and the target basic action is any one of lane changing, overtaking, car following, parking and straight going along a road axis; controlling the target vehicle to execute the target basic action according to the actual control instruction; obtaining a return parameter value according to the road condition data after the target vehicle executes the target basic action; and correcting the control parameter value of the controller model according to the return parameter value.

In a possible implementation manner of the second aspect, the determining that the first control instruction or the second control instruction is an actual control instruction includes: and randomly determining the first control instruction or the second control instruction as the actual control instruction according to a probability, wherein the first control instruction is determined as the actual control instruction according to a first probability, and the second control instruction is determined as the actual control instruction according to a second probability.

Optionally, the sum of the first probability and the second probability is equal to 1.

Further, the first probability is smaller as the number of repetitions increases, and the second probability is larger as the number of repetitions increases, so that when the first control instruction or the second control instruction is determined to be an actual control instruction, it is more and more apt to determine that a control instruction generated by a controller model is an actual control instruction. In practice, when it is determined that the control command generated by the controller model is an actual control command to a certain extent, for example, the second probability is 0.9 or 1, the reinforcement process of the controller model may be considered to be completed, and the reinforcement learning process may not be repeated.

It should be noted that the process of reinforcing the controller model may also be a process of controlling the vehicle by using the controller model, and the control parameters of the controller model are continuously optimized and corrected in the using process; optionally, the controller model may be enhanced in the real vehicle driving process, or the controller model may be enhanced by simulating vehicle driving with a computer.

The method for strengthening the controller model is used for strengthening the controller model according to the multi-dimensional random distribution structure exploration strategy, and can reduce the exploration space in the process of designing/training the controller model so as to improve the convergence speed of the controller model.

In a third aspect, the present application provides an apparatus for obtaining a multidimensional random distribution, the multidimensional random distribution being used for reinforcing a controller model, the apparatus comprising: the data acquisition module is used for acquiring historical driving data, and the historical driving data comprises a bottom layer control instruction for executing a target basic action; the first processing module is used for processing the historical driving data acquired by the data acquisition module to obtain a plurality of control instruction samples, and each control instruction sample in the plurality of control instruction samples is a control instruction sequence which is formed by bottom layer control instructions with time sequences and is used for executing the target basic action; the first calculation module is used for calculating an average control instruction according to a plurality of control instruction samples obtained by the first processing module, and the average control instruction is used for indicating a control instruction sequence which is formed by bottom layer control instructions with control values at an average level and time sequence; the second calculation module is used for calculating multi-dimensional random distribution according to the average control instruction obtained by the first calculation module and the plurality of control instruction samples obtained by the first processing module, wherein the multi-dimensional random distribution is an expected function distribution which is disturbed in a certain range around the average control instruction; the bottom layer control command comprises one or more of an acceleration parameter, a corner parameter and a braking parameter.

A possible implementation manner of the third aspect further includes: the statistical module is used for counting the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples obtained by the first processing module, wherein the instruction length is used for indicating the number of bottom layer control instructions with time sequence; a third calculating module, configured to calculate an average instruction length according to the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples obtained by the counting module, where the average instruction length is used to indicate an average value, a median, or a maximum value obtained according to the instruction length of the control instruction sequence in each control instruction sample in the plurality of control instruction samples; a second processing module, configured to process the instruction length of the control instruction sequence of each control instruction sample in the multiple control instruction samples obtained by the first processing module into an average instruction length obtained by the third calculating module; the first calculating module is specifically configured to calculate the average control instruction according to the multiple control instruction samples obtained by the second processing module after the instruction length processing, where the instruction length of the average control instruction is the average instruction length. The second calculation module is specifically configured to obtain the multidimensional random distribution according to the average control instruction obtained by the first calculation module and the multiple control instruction samples processed by the instruction length obtained by the second processing module.

In a possible implementation manner of the third aspect, the control instruction sequence of each control instruction sample in the plurality of control instruction samples includes at least one control instruction string, each control instruction string in the at least one control instruction string is composed of bottom layer control instructions of the same type with time sequences, and the bottom layer control instructions of each control instruction string in the at least one control instruction string are equal in number and correspond in time sequence; the instruction length is used for indicating the number of the bottom layer control instructions of any control instruction string in the at least one control instruction string.

In a fourth aspect, the present application provides an apparatus for reinforcing a controller model, the apparatus comprising: the first action generating unit is used for generating a first control instruction according to the multi-dimensional random distribution; the environment unit is used for acquiring current road condition state data and inputting the current road condition data into the controller unit; the controller unit is used for receiving the current road condition state data input by the environment unit and generating a second control instruction; a determining unit, configured to determine that a first control instruction generated by the first generating unit or a second control instruction generated by the controller unit is an actual control instruction, where the actual control instruction is used to control a target vehicle to execute the target basic action, and the target basic action is any one of lane changing, passing, following, parking, and straight traveling along a road axis; the execution unit is used for controlling the target vehicle to execute the target basic action according to the actual control instruction determined by the determination unit; the return parameter calculating unit is used for obtaining a return parameter value according to the road condition data after the executing unit controls the target vehicle to execute the target basic action; and the correcting unit is used for correcting the control parameter value of the controller unit according to the return parameter value obtained by the return parameter calculating unit.

In a possible implementation manner of the fourth aspect, the determining unit is specifically configured to: and randomly determining a first control command generated by the first generation unit or a second control command generated by the controller unit as the actual control command according to a probability, wherein the first control command is determined as the actual control command with a first probability, and the second control command is determined as the actual control command with a second probability.

A possible implementation manner of the fourth aspect further includes: the probability updating unit is used for updating the first probability and the second probability, wherein the first probability is smaller and smaller as the correction times increase, and the second probability is larger and larger as the correction times increase.

In a fourth aspect, the present application provides an automobile comprising: the apparatus for obtaining a multidimensional random distribution described in the third aspect or any possible implementation manner of the third aspect, and/or the robust controller model apparatus described in the fourth aspect or any possible implementation manner of the fourth aspect.

In a fifth aspect, the present application provides a storage medium having stored therein programmable instructions that, when run on a computer, cause the computer to perform a method for obtaining a multidimensional stochastic distribution as described in the first aspect or any one of the possible implementations of the first aspect, and/or a method for enhancing a controller model as described in the second aspect or any one of the possible implementations of the second aspect.

In a sixth aspect, the present application provides a computing device comprising at least one processor and at least one memory, where the at least one memory stores programmable instructions, and the at least one processor invokes the programmable instructions to perform a method for obtaining a multidimensional random distribution as described in the first aspect or any possible implementation manner of the first aspect, and/or a method for reinforcing a controller model as described in the second aspect or any possible implementation manner of the second aspect.

The method for strengthening the controller model by the multi-dimensional random distribution obtained by the method for obtaining the multi-dimensional random distribution can reduce exploration space in the process of designing/training the controller model, thereby improving the convergence speed of the training controller model.

Drawings

FIG. 1 is a schematic diagram of a sequence of control instructions for performing a specified basic action according to the present application;

FIG. 2 is a system architecture diagram presented herein;

FIG. 3 is a flow chart of a method for obtaining a multi-dimensional random distribution according to the present application;

FIG. 4 is a flow chart of a method for obtaining a multi-dimensional random distribution according to the present application;

FIG. 5 is a flow chart of a method of reinforcing a controller model according to the present application;

FIG. 6 is a schematic diagram of a computing device presented herein;

FIG. 7 is a schematic diagram of an apparatus for obtaining a multi-dimensional random distribution according to the present application;

FIG. 8 is a schematic diagram of a computing device presented herein;

FIG. 9 is a schematic diagram of an apparatus for enhancing a controller model according to the present application;

FIG. 10 is a schematic diagram of a reinforcement learning system according to the present application;

FIG. 11 is a flow chart of a method of reinforcing a controller according to the present application;

fig. 12 is a block diagram of an actor-critic model presented in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

In the technical field of motion planner, a controller generates a bottom layer control instruction to control an unmanned vehicle to execute a task of basic action, for example, generating a bottom layer control instruction comprises that a right steering angle is 15 degrees, and inputting the generated control instruction of 15 degrees of right steering into a steering actuator can control the unmanned vehicle to execute right steering operation, so that how to obtain the controller with more accurate effect of generating the bottom layer control instruction becomes a key problem in the technical field, and the problem of obtaining the controller in the prior art is solved.

The application provides a statistical analysis-based method for reinforcing a controller model, which is used for training/designing the controller model for an unmanned vehicle, and a controller obtained by the method provided by the application can be applied to real-time generation of a bottom layer control instruction according to the unmanned vehicle and the surrounding environment state, wherein the bottom layer control instruction comprises the acceleration, the steering angle, the brake and the like, and further controls the vehicle to execute corresponding basic actions, and the basic actions comprise lane changing, overtaking, car following, automatic parking and the like. The method mainly relates to three parts: data acquisition, data statistical analysis and reinforcement learning.

In the data acquisition part, acquiring a batch of previously generated driving data offline or online, wherein the driving data comprises control instruction data for executing a certain specific basic action, the certain specific basic action can be called a target basic action, and the driving data comprises underlying control instruction data for executing the target basic action; the driving data may be obtained from a public data source, or from data recorded during driving of a real vehicle, or from data recorded in driving simulation, and the specific obtaining route is not limited, and will be described in detail below. Generally, the collected original data is driving data of a long period of time, and there may be a mixture of multiple basic actions (for example, a lane change action occurs after one hour of straight line is performed, or a left lane change and a right lane change are performed alternately), so it is necessary to extract driving data for executing a target basic action to form a set formed by a control instruction sequence for executing the target basic action with a finite length and a time sequence.

The sample collection part is a process of collecting effective control instruction empirical data for a specified basic action. Three possible acquisition methods are described in detail below:

collecting driving data through a public data set; the existing disclosed unmanned data set includes Oxford dataset, Udacity dataset, KITTI dataset, etc. The driving data provided by the general public data set comprises effective information such as pictures taken by a vehicle-mounted camera, data of a plurality of sensors, corresponding control instructions and the like, and firstly, each frame of image data needs to be converted into state information such as affordance information which can be used for judging the state of the unmanned vehicle by an image processing method; then, judging the state information of continuous multiple steps through a special rule, and positioning a starting time stamp and an ending time stamp for executing the specified basic action; and finally, intercepting a plurality of control instruction sequence segments for executing the specified basic action from the continuous long-time control instruction sequence according to the starting time stamp.

The real vehicle is used for continuously executing the designated action, and recording the corresponding control instruction to collect effective data, and the real vehicle is not limited to be driven by a human driver or driven by an unmanned vehicle through an existing rule algorithm when executing the designated action. The method is characterized in that a recording module is added in a computer system of the moving vehicle, when a driver or an unmanned vehicle with embedded rules executes the designated basic action, the recording module records a corresponding starting time stamp, and therefore a plurality of control instruction sequence segments for executing the designated basic action are extracted from the driving data of the real vehicle. Because the real vehicle can be manually controlled, the specified action can be repeatedly executed, and a large number of samples can be obtained.

Collecting data using the simulation environment; selecting a simulator which is highly similar to the human driving environment, such as a Scaner platform, a GTAV platform and the like, enabling a self-body to continuously execute specified basic actions, and collecting a corresponding control instruction sequence. The algorithm for executing the specified action can be an algorithm carried by a built-in player in the simulation platform, and can also be an artificial rule algorithm controlled by a developer through a simulation platform interface.

Fig. 1 is a schematic diagram of a control instruction sequence for executing a specific basic action.

In the data statistical analysis part, obtaining multidimensional random distribution for strengthening the controller model according to the collected driving data; specifically, firstly, processing a control instruction sequence set for executing a target basic action obtained by a data acquisition part, and increasing and cutting according to a certain rule to enable the lengths of the processed control instruction sequences to be the same; then, a multidimensional random distribution which is disturbed around the average control instruction sequence is statistically constructed according to the obtained control instruction sequences with equal lengths, wherein the average control instruction is used for indicating the control instruction sequence which is formed by bottom layer control instructions with control values at an average level and time sequences and is used for executing the target basic action, and how to obtain the average control instruction will be further described below, and details are not repeated here.

In the reinforcement learning part, firstly, a state space, an action space and a return function required by a reinforcement learning algorithm are designed; the quantity representing the state can comprise an image at a first visual angle, road traffic network data, sensor information, abstracted intermediate state quantity and the like, and corresponds to road condition state data; the quantity of the characterization actions corresponds to the bottom layer control instruction of the unmanned vehicle and comprises one or more of parameters such as steering angle, acceleration, accelerator and the like, and the quantity of the characterization actions can be discrete or continuous; the selected training algorithm in the training process depends on the actual situation to be selected, for example, the discrete motion space can be selected as the DQN algorithm, and the continuous control strategy can be selected as the DDPG algorithm. It should be noted that the driving link related to the reinforcement learning system may be the agent driving in the simulator, or the real vehicle driving on the actual road. After the reinforcement learning training, a controller model which takes the characteristics of the driving road condition as input and takes the bottom-layer control instruction as output can be obtained.

As shown in fig. 2, a possible system architecture provided by the embodiment of the present application is provided. The vehicle-mounted computer system is used for collecting driving data, and can comprise a communication component, a storage system and a CPU (central processing unit), wherein the components cooperate to complete the collection of vehicle control data in the driving process of the mobile vehicle; the control data of the driving process of the mobile automobile is recorded to be a possible way of acquiring the data, and the vehicle-mounted computer system can be operated on computing equipment to acquire the data by simulating the driving of the automobile; the vehicle-mounted computer system may also be configured to cooperate with a sensing device and a camera mounted on the mobile vehicle to acquire a required road condition state of the mobile vehicle, and it should be noted that the acquisition of the road condition state may also be completed by cooperating with other vehicle-mounted computer systems or devices on the mobile vehicle, the sensing device and the camera, and the details thereof are not described herein. The data analysis server is used for calculating multidimensional random distribution according to the collected driving data, the data analysis server comprises a communication component, a storage system and a CPU, the CPU and the storage system cooperatively participate when in operation, the generated data can be stored in the storage system, and a method for calculating the multidimensional random distribution according to the collected driving data is described in the following embodiment. The training server is used for supporting and realizing reinforcement learning, the training server comprises a communication component, a storage system, a CPU and a GPU, the CPU and the storage system cooperatively participate when in operation, process data and a controller model are stored in the storage system of the training server, and the GPU is used for analyzing input image data in the process of training the controller model; likewise, the work to be done by the training server may also be performed by the on-board computer system used to collect the data or other debt computer systems on the mobile car. The data acquisition computer system, the data analysis server and the training server are communicated through respective communication components.

The system architecture shown in fig. 2 is one possible system architecture given in the present application, and in practice, the data acquisition computer system, the data analysis server, and the training server in fig. 2 may be replaced by a computer system that runs on a vehicle.

The method provided by the application aims to solve the problems in the existing unmanned vehicle motion planer design scheme, does not need to manually make various rules, is self-adaptive to various complex road conditions in a mode of autonomous exploration and learning of the unmanned vehicle, and learns the strategy behaviors under the complex road conditions, wherein the rules of the strategy behaviors cannot be clearly defined; the method provided by the application does not need any dynamic model and surrounding environment model, only needs easily acquired observation information (such as a first visual angle image, a lane line distance of the unmanned vehicle and the like), and can learn a reasonable action strategy through interaction between the unmanned vehicle and the environment; the method provided by the application calculates an average control command which can execute the basic target action by carrying out statistical analysis on a large amount of real data, then freely explores around the average control command, improves and relearns based on the existing driving experience, accords with the intuition of human learning, greatly reduces exploration space and improves the convergence speed of the model; meanwhile, the model is learned based on real data, so that the learned model is easy to migrate to a real vehicle for use.

The embodiment of the present application provides a method for obtaining a multidimensional random distribution, where the obtained multidimensional random distribution is used for enhancing a control model, and as shown in fig. 3, the method includes:

s101, historical driving data are obtained, wherein the historical driving data comprise bottom layer control instructions for executing target basic actions; optionally, the target basic action is any one of lane changing, overtaking, car following, parking and straight-ahead movement along the road axis, and the bottom layer control command comprises one or more of an acceleration parameter, a turning angle parameter and a braking parameter.

S102, processing historical driving data to obtain a plurality of control instruction samples, wherein each control instruction sample in the plurality of control instruction samples is a control instruction sequence which is formed by bottom layer control instructions with time sequences and is used for executing a target basic action; furthermore, the control instruction sequence of each control instruction sample in the plurality of control instruction samples comprises at least one control instruction string, each control instruction string in the at least one control instruction string is composed of bottom layer control instructions of the same type with time sequences, and the bottom layer control instructions of each control instruction string in the at least one control instruction string are equal in number and correspond in time sequence; the instruction length is used for indicating the number of bottom-layer control instructions of any control instruction string in the at least one control instruction string;

s103, obtaining an average control instruction according to the plurality of control instruction samples, wherein the average control instruction is used for indicating a control instruction sequence which is formed by bottom layer control instructions with control values at an average level and time sequence and is used for executing the target basic action;

and S104, obtaining multi-dimensional random distribution according to the average control instruction and the plurality of control instruction samples, wherein the multi-dimensional random distribution is an expected function distribution which is disturbed in a certain range around the upper average control instruction, and a vector obtained by sampling from the distribution is not far away from the value of the average control instruction in a statistical sense, so that the follow-up free exploration for strengthening the controller is ensured to be fine adjustment on the basis of human average behaviors, and meaningless exploration deviating from human basic behaviors is not generated.

Optionally, before S103, as shown in fig. 4, the method further includes:

s105, counting the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples, wherein the instruction length is used for indicating the number of bottom-layer control instructions with time sequence;

s106, calculating an average instruction length according to the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples, wherein the average instruction length is used for indicating an average value, a median or a maximum value obtained according to the instruction length of the control instruction sequence in each control instruction sample in the plurality of control instruction samples;

s107, processing the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples into an average instruction length;

s103 specifically comprises the following steps: obtaining an average control instruction according to a plurality of control instruction samples after instruction length processing, wherein the instruction length of the average control instruction is the average instruction length;

s104 specifically obtains the multidimensional random distribution according to the average control instruction with the instruction length being the average instruction length and the plurality of control instruction samples processed by the instruction length.

Optionally, the control instruction sequence of each control instruction sample in the plurality of control instruction samples includes at least one control instruction string, each control instruction string in the at least one control instruction string is formed by bottom layer control instructions of the same type with time sequences, and the bottom layer control instructions of each control instruction string in the at least one control instruction string are equal in number and correspond in time sequence, so that the instruction length is used to indicate the number of bottom layer control instructions of any control instruction string in the at least one control instruction string.

In a specific example, assuming that a steering angle needs to be controlled to execute a lane change action, it is found through statistics of acquired driving data that 15 steps with time sequences are required on average to execute a lane change action, and each lane change control instruction sample is increased or truncated to be unified into 15 dimensions, that is, the instruction length is 15, and each lane change control instruction sample is a 15-dimensional vector; according to the lane change control instruction sample obtained in the last step, the average steering angle theta of each step can be calculated_iSo as to obtain a 15-dimensional steering angle average sequence { theta }₁,...θ₁₅}; by { theta }₁,...θ₁₅Taking the average value as the mean value, and averaging the steering angle value and the steering angle average sequence { theta ] of each step of each lane change control instruction sample₁,...θ₁₅The mean difference lists of the corresponding steps are combined together to form a 15 × N dimensional matrix X, N is the number of channel change control instruction samples, and the covariance matrix is the 15 × 15 dimensional matrix XX^TUsing the mean vector θ ═ θ₁,...θ₁₅And covariance matrix XX^TA 15-dimensional gaussian distribution can be constructed as follows:

it should be noted that the above is only an exemplary possible multidimensional random distribution, and other multidimensional random distributions may also be generated according to different requirements.

From the above example, it can be summarized that the method for obtaining the multi-dimensional random distribution includes the following steps:

1) calculating the length of the average control instruction sequence according to the instruction length of all the control instruction samples and a specific rule

The specific rule may include calculating an average, median or maximum value of all length values, etc.

2) All control instruction samples are increased or cut, so that the lengths of the processed instructions are all

Then each control instruction sample may be one

A sequence of dimensions.

3) All control instruction sample generation processed by the above

Average control instruction of dimension.

4) Generating one control instruction by the processed control instruction sample of 2) and the average control instruction generated by 3) for each dimension

Dimension random distribution P_h，P_hA random perturbation is made to the average control command sequence around the control dimension.

The embodiment of the application provides a method for strengthening a controller model, which is a process of strengthening learning, and the controller obtained by the method can be used for unmanned vehicle control and can also be used for optimizing the existing controller. As shown in fig. 5, the method specifically includes:

s201, generating a first control instruction according to the multidimensional random distribution, and optionally, obtaining the multidimensional random distribution by using the method described in the embodiment corresponding to fig. 3 and/or fig. 4, which is not described herein again;

s202, acquiring current road condition state data and inputting the current road condition data into a controller to generate a second control instruction;

s203, determining the first control instruction or the second control instruction as an actual control instruction, wherein the determined actual control instruction is used for controlling the target vehicle to execute the target basic action;

s204, controlling the target vehicle to execute the target basic action according to the actual control command;

s205, obtaining a return parameter value according to the road condition data after the target vehicle executes the target basic action;

s206, correcting the control parameter value of the controller according to the return parameter value.

S201-S206 are a strengthening process, and repeating the steps S201-S206 can continuously optimize the control parameters of the controller to obtain a controller with better performance.

Further, S203 specifically is: randomly determining a first control instruction or a second control instruction as an actual control instruction according to the probability, wherein the first control instruction is determined as the actual control instruction according to the first probability, the second control instruction is determined as the actual control instruction according to the second probability, and optionally, the sum of the first probability and the second probability is equal to 1; further, S201-S206 are repeatedly executed, wherein the first probability is smaller as the number of times S201-S206 are repeatedly executed increases, the second probability is larger as the number of times S201-S206 are repeatedly executed increases, and the control instruction generated by the controller is more and more prone to be determined as the actual control instruction as the number of times S201-S206 are repeatedly executed increases.

An embodiment of the present application provides a computing device 1000 for implementing the method for obtaining a multidimensional random distribution described in the corresponding embodiment of fig. 3 and/or fig. 4, as shown in fig. 6, the computing device 1000 includes: a processor 1001, a memory 1002, a communication component 1003 for communicating with the outside world to receive or output data, such as to obtain driving data; the memory 1002 is used for storing programmable instructions and data executed by the computing device 1000, and the processor 1001 is used for executing the programmable instructions stored in the memory 1002 to implement the method described in the corresponding embodiment of fig. 3 and/or fig. 4; the processor 1001, memory 1002, and communications component 1003 are communicatively coupled, such as by a bus 1004.

It is noted that, in practical applications, a computing device may include one or more processors, and the structure of the computing device 1000 is not limited in this application.

An embodiment of the present application provides an apparatus 100 for obtaining a multidimensional random distribution, as shown in fig. 7, the apparatus 100 includes: a data acquisition module 110, a first processing module 120, a first calculation module 130, and a second calculation module 140. The data acquisition module 110 is used for acquiring historical driving data; the first processing module 120 is configured to process the historical driving data acquired by the data acquisition module to obtain a plurality of control instruction samples; the first calculating module 130 is configured to calculate an average control command according to the multiple control command samples obtained by the first processing module 120; the second calculating module 140 is configured to calculate a multidimensional random distribution according to the average control instruction obtained by the first calculating module 130 and the plurality of control instruction samples obtained by the first processing module 120.

Optionally, the apparatus 100 further includes a statistics module 150, a third calculation module 160, and a second processing module 170; the counting module 150 is configured to count an instruction length of a control instruction sequence of each control instruction sample in the plurality of control instruction samples obtained by the first processing module 120; the third calculating module 160 is configured to calculate an average instruction length according to the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples obtained by the counting module 150, where the average instruction length is used to indicate an average value, a median, or a maximum value obtained according to the instruction length of the control instruction sequence in the plurality of control instruction samples; the second processing module 170 is configured to process the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples obtained by the first processing module 120 into the average instruction length obtained by the third calculating module 160; the first calculating module 130 is specifically configured to calculate an average control instruction according to the multiple control instruction samples processed by the instruction length obtained by the second processing module 170; the second calculating module 140 is specifically configured to obtain a multi-dimensional random distribution according to the multiple control instruction samples processed by the instruction length obtained by the second processing module 170 and the average control instruction obtained by the first calculating module 130.

The specific implementation of the modules described in the above embodiment may be implemented by the processor 1001 calling the programmable instructions stored in the memory 1002 in the corresponding embodiment of fig. 6.

The embodiment of the present application provides a computing device 2000 for implementing the method for enhancing the controller model described in the corresponding embodiment of fig. 5, as shown in fig. 8, where the computing device 2000 includes: a processor 2001, a memory 2002, a communication component 2003 for communicating with the outside world to receive or output data, such as the multi-dimensional random distribution obtained by the computer device 1000 or the apparatus 100 described in the above embodiments; the memory 2002 is used for storing programmable instructions and data for execution by the computing device 2000, and the processor 2001 is used for executing the programmable instructions stored in the memory 2002 to implement the method described in the corresponding embodiment of fig. 5; the processor 2001, memory 2002, and communication component 2003 are communicatively coupled, such as via a bus 2004.

The embodiment of the present application provides an apparatus 200 for reinforcing a controller model, as shown in fig. 9, the apparatus 200 includes: a first action generation module 210, an environment module 220, a controller module 230, a determination module 240, an execution module 250, a reward parameter calculation module 260, and a modification module 270. The first action generating module 210 is configured to generate a first control instruction according to multidimensional random distribution; further, the apparatus 200 further includes a receiving module for obtaining the multidimensional random distribution from the apparatus 100 or the computing device 1000 described in the above embodiments; optionally, the apparatus 200 is the same apparatus as the apparatus 100, and the multidimensional random distribution is sent to the first action generating module 210 of the apparatus 200 by the second calculating module 140 of the apparatus 100; the embodiment embeds the receiving module into the first action generating module 210, which is not marked again in fig. 9; the environment module 220 is configured to obtain current road condition status data of the vehicle and input the current road condition data into the controller module 230; it should be noted that the controller module may be a component of the apparatus 200, or may be independent of the controller of the apparatus 200, and in the embodiment of the present application, the controller module is described as an example of the component of the apparatus 200; the controller module 230 is configured to generate a second control instruction according to the current road condition status data input by the environment module 220; the determining module 240 is configured to determine that the first control instruction generated by the first generating module 210 or the second control instruction generated by the controller module 230 is an actual control instruction, and further, the determining module 240 randomly determines, according to a probability, the first control instruction generated by the first generating unit or the second control instruction generated by the controller unit as the actual control instruction, where the first control instruction is determined as the actual control instruction with a first probability and the second control instruction is determined as the actual control instruction with a second probability; the execution module 250 is used for controlling the target vehicle to execute the target basic action according to the actual control instruction determined by the determination module 240; the return parameter calculating module 260 is configured to obtain a return parameter value according to the road condition data after the executing module 250 controls the target vehicle to execute the target basic action; the correcting module 270 is configured to correct the control parameter value of the controller module 230 according to the reported parameter value obtained by the reported parameter calculating module 260.

Further, the apparatus 200 further comprises: the probability update module 280 is configured to update the first probability and the second probability, wherein the first probability is smaller as the number of corrections increases, and the second probability is larger as the number of corrections increases.

The specific implementation of the modules described in the above embodiment may be that the processor 2001 calls the programmable instructions stored in the memory 2002 in the corresponding embodiment in fig. 8.

In practical applications, the device 100 or the computing apparatus 1000 described in the above embodiment may obtain the multidimensional random distribution offline, and send the multidimensional random distribution to a vehicle or other products on-line for strengthening the controller, in which the device 200 and the computing apparatus 2000 described in the above embodiment are installed, or send the multidimensional random distribution to a device, such as a training server, that can simulate driving of the vehicle and operate the device 200 and the computing apparatus 2000. The computing device 1000 and the computing device 2000 described in the above embodiments may also be the same computing device, such as the same vehicle-mounted computer system, and similarly, the apparatus 100 and the apparatus 200 may be the same apparatus.

An example of a particular ruggedized controller, comprising the steps of:

1) firstly, defining necessary elements in a reinforcement learning algorithm according to an actual problem: state space S_tMotion space A_tAnd a return function R_t(ii) a Wherein S is_tThe information can be composed of images shot by a front camera of the unmanned vehicle, sensor information or other effective information, and the information must be acquired in real time or acquired through processing during the driving process of the unmanned vehicle. A. the_tCan be designed into m-dimensional continuous control instructions, namely A, according to the design requirements of the controller_t∈R^m(ii) a Or dispersing each dimension control instruction into a plurality of intervals, and assuming that the k dimension control instruction is dispersed into n_kAn interval of

R_tThe design of (a) can consider factors such as the included angle between the orientation of the unmanned vehicle and the axis of the road, collision punishment, reward after successfully executing specified actions and the like.

2) S defined by 1) above_t，A_t，R_tConstructing a reinforcement learning system; as shown in fig. 10, the system includes a self-body module, an environment module, a communication module, and a training module, wherein the self-body module includes an action generation module and an information calculation module. If the model is trained in the simulator environment, the self-body module is the AI vehicle in the simulator, and if the model is trained in the real vehicle environment, the self-body module is the unmanned vehicle.

The data information generated by the reinforcement learning system is interacted in such a way that, at the t step, the state S is acquired from the environment module by the main body module_tThe information calculation module is based on S_tCalculating a reward value R_t-1(ii) a The motion generation module passes S_tThe current controller model and the random distribution constructed in the process 2 generate corresponding exploration action commands A_tFeeding back to the self-body module; from the subject to execute A_tThen, the state quantity S of the next step is obtained from the environment_t+1And R_t(ii) a Meanwhile, S is transmitted from the main body module through the communication module_t、R_t-1Sending to a training module, and receiving S by the training module_tAnd R_t-1Then, adding the model into an experience pool, and carrying out model training; the training module regularly pushes the updated controller model to the unmanned vehicle model through the communication moduleAn action generation module in the block.

3) Performing reinforcement learning training by using the online reinforcement learning system constructed in the above 2), and obtaining an input S_tThe output is A_tThe controller model of (1). The number of the epicodes and the stop rule required by training can be set according to the actual situation, and the applied strong chemical algorithm is based on S_t、A_tThe design of (2) can select a Q-learning algorithm, a DQN algorithm, a DDPG algorithm and the like.

One specific embodiment provided by the present application, as shown in fig. 11, includes the following:

s1, a human driver drives a real vehicle to continuously execute a lane changing action, and a real lane changing control instruction sequence is recorded. The method comprises the following steps: the method comprises the following steps:

s11: a recording module is added in a computer software system of the mobile vehicle, and when a driver drives, the recording module records a three-dimensional control instruction (steering angle, acceleration and accelerator) corresponding to each timestamp in the driving process and a starting timestamp corresponding to a lane changing action executed by the driver. The human driver repeatedly executes the lane-changing action, and the recording module records a large amount of driving data related to the lane-changing action.

S12: by using the driving data acquired in S11, the action sequence extraction module extracts a three-dimensional control instruction sequence corresponding to the lane change action according to the start time stamp of the lane change action, to obtain a control instruction sequence set of the lane change action.

Another alternative of S1 is as follows:

and S1', acquiring effective data through the public data set, and acquiring a real lane change control instruction sequence. The presently disclosed driving data set includes Oxford dataset, Udacity dataset, KITTI dataset, and the like. The driving data provided by the general public data set comprises effective information such as pictures taken by a vehicle-mounted camera, a plurality of sensor data, corresponding control instructions and the like. S1' includes the following steps:

s11' converting each frame of image data into state information that can be used to determine the state of the unmanned vehicle by means of image processing. For example, a trained deep learning model is used for establishing a road network of god view angles for each frame of picture of continuous driving videos, each grid in the road network corresponds to an integer between 0 and 100, the confidence degree of obstacles existing in the grid is represented, and the confidence degree is higher if the value is larger. The horizontal relative distance between the main vehicle and the road edge can be calculated through the road network information of each frame of picture.

And S12', judging the state information of continuous multiple steps by the judgment rule according with the driving habit of human, and positioning the starting time stamp and the ending time stamp for executing the lane changing action.

S13', a plurality of control instruction sequence segments for executing the specified action are intercepted from the continuous long-time control instruction sequence according to the starting time stamps.

And S2, processing the control instruction sequence collected in the step S1 and generating multidimensional random distribution. The method comprises the following steps:

s21, calculating the length average value of all the control instruction sequences collected in S1

S22, increasing or cutting all control instruction sequences to enable the lengths of the processed sequences to be equal

Let each control command sequence be { T }_i(t),t＝1,...L_iIn which L is_iFor the length of the sequence, the control instruction sequence which has been increased is

One possible method of increasing the cut is as follows:

when in use

When the temperature of the water is higher than the set temperature,

when in use

Time of flight

S23. for three control dimensions, all generated by S22

Dimension control command sequences, respectively generating corresponding

Dimension average control instruction sequence and

dimension covariance matrix:

s24. respectively consisting of

And

three structures

Distribution of Wei Gauss

P_{Steering angle},P_{Acceleration of a vehicle},P_BrakeThe probability density function distribution is respectively:

and S3, constructing an online reinforcement learning system, and performing online training by using the random distribution generated in the S3 as a free exploration strategy to generate a lane change controller model. The steps involved include:

s31, defining a necessary element state space S of a reinforcement learning algorithm according to the sensing capability of the unmanned vehicle and the control dimension of the controller to be solved_tMotion space A_tAnd a return function R_t。

a) Set of possible S_tThe design is as follows:

b) a set of possible A_tThe design is as follows:

name of variable	Explanation of the invention
		Steering angle	The range is [ -1,1 [)]Output-1 means maximum right turn, +1 means maximum left turn
Acceleration of a vehicle	The range is [ -1,1 [)]Output 0 represents no acceleration, 1 represents full acceleration
		Brake	The range is [ -1,1 [)]Output 0 indicates no braking, 1 indicates emergency braking

c) Set of possible R_tDesigned such that if no collision occurs, then

R_t＝V_x cosθ-V_y sinθ-V_x|trackPos|

If a collision occurs, then

R_t＝-500

The intuitive explanation is to award the axial speed of the unmanned vehicle, punish the transverse speed of the unmanned vehicle, collide and punish the behavior that the position of the unmanned vehicle deviates from the road center axis.

S32, aiming at the state space S defined in S31_tMotion space A_tAnd a return function R_tAn online reinforcement learning system is constructed, which is shown with reference to fig. 10.

The self-body module corresponds to an unmanned vehicle, the environment module corresponds to a multi-lane real road with lane lines, other traveling moving vehicles on the road have the opportunity of changing lanes for the unmanned vehicle, and meanwhile the arrangement of the other moving vehicles meets certain randomness, so that the model can be trained on various data sets, and the generalization capability of the model is enhanced.

The information acquisition module and the perception fusion module of the unmanned vehicle receive various external information and process the external information into S_t(ii) a In the t step, the information calculation module receives S output by the perception fusion module_tAnd through S_tCalculating the corresponding R_t-1The action generation module generates a probability distribution and a current controller model as S using S2_tGenerate corresponding action instruction A_tUnmanned vehicle execution A_tThen, the state quantity S of the next step is obtained from the environment_t+1And R_t(ii) a Meanwhile, the unmanned vehicle sends S through the communication module_t、R_t-1Sending the data to a training module running at a server end, and receiving S by the training module_tAnd R_t-1Then, adding the model into an experience pool for model training and carrying out model training; the training module periodically pushes the updated model to the action generation module of the unmanned vehicle through the communication module.

S33, carrying out reinforcement learning training by using the online reinforcement learning system constructed in the S32 to obtain an input S_tThe output is A_tThe controller model of (1).

In the reinforcement learning training process, one epicode is defined as the number of steps used by the self-body to complete a specific action, and if the self-body cannot complete the specified action within the preset maximum number of steps, one epicode is terminated.

At each epicode in the training process, the probability distribution P generated by the action generation module from S2_{Steering angle},P_{Acceleration of a vehicle},P_BrakeRespectively sampling 3 points A^{Steering angle},A^{Acceleration of a vehicle},A^BrakeWherein each point corresponds to a point of length of

Of the corresponding dimension. Note that the three-dimensional control instruction sequence for the generation of the ith epamode is:

the action generation module takes the probability epsilon at the jth step of this epsilon_ijGenerating control instructions

With 1-epsilon_ijThe probability of A is generated according to the model pushed by the current action generating module_tWhere { epsilon_ijThe condition to be satisfied is

The training process can adopt a DDPG algorithm for training, wherein the DDPG algorithm adopts an actor-critic framework and comprises a strategy function A(s) and an evaluation function Q (s, a); wherein the policy function is called an actor and the cost function is called a critic. Essentially, the actor produces an action a in a given state s of the current environment, and the commentator produces a signal to criticize the action made by the actor; the DDPG algorithm uses SARSA as a critic model and a policy gradient algorithm as an actor model, the structure diagram of the actor-critic model is shown in fig. 12, and in the t-th s step, the parameter w corresponding to the Q function is updated by the following formula:

the parameter w corresponding to the Q function is updated by:

the parameters corresponding to the a function are updated by the following formula:

if the continuous strategy space is discretized, the training process can also adopt the DQN algorithm. The DQN algorithm is mainly applied to the discrete decision problem, and includes a policy function a(s) and an estimation function Q (s, a), and in the tth step, the parameter w corresponding to the Q function is updated by the following formula:

the parameter w corresponding to the Q function is updated by the following formula

Policy function A is updated as follows

Through the steps of S1-S3, the baseline data of successful lane changing is collected from the real lane changing data, and an average control instruction of lane changing is obtained; the free exploration is carried out around the average control instruction, the exploration process is limited to be close to the average control sequence for successfully executing the channel change, the exploration space can be reduced in a targeted manner, and the algorithm convergence speed is greatly accelerated. Meanwhile, the real data of the real vehicle driven by a human driver are counted, the average control sequence is calculated, the dynamic model of the real vehicle is conformed, agent collision during training is avoided to the maximum degree, the real vehicle training is more suitable, and the trained model is easier to be in an actual scene. The unmanned vehicle can continuously perform reinforcement learning training in complex road conditions, can continuously optimize under the condition of meeting the requirement of basic lane changing actions, enhances the adaptability and driving comfort of the face of complex emergent road conditions, and realizes the fine adjustment of control strategies.

In another embodiment provided herein, S2, S31, and S33 of this embodiment are the same as those of the above embodiment, and the different steps are as follows:

s1, selecting an automatic driving simulator torcs to collect experience data. And continuously sending high-level instructions of 'changing left track' and 'changing right track' to the AI vehicle in the simulator through a development interface of the torcs, so that the AI vehicle continuously executes track changing actions, and recording the start time stamps ts and tf and the corresponding time stamp of each track changing action, wherein the control instruction in the time period of [ ts, tf ] forms a track changing control instruction sequence. In order to develop a simple and basic lane change experience for the AI cars, a rule algorithm may be applied to change lanes for the AI cars. In this way, a set of lane-change control instruction sequences may be collected.

And S32, constructing an online reinforcement learning system aiming at the input, output and return functions defined in S31. The self-body of the whole online reinforcement learning system is an AI vehicle controlled by a program, the environment is a torcs simulation environment, and a track with multiple lanes is selected on a map; multiple other AI vehicles need to be deployed in a torcs, constantly creating opportunities for autonomous body manufacturing to change lanes. The whole software architecture comprises a perception fusion module of the torcs, a model training module, a communication module, an information calculation module of the AI vehicle and an action generation module.

In the t-th step, the perception fusion module of the AI vehicle receives various information from the torcs and converts the information into a state S_tThe information calculation module is based on S_tCalculating a reward value R_t-1The motion generation module passes S_tGenerate corresponding action instruction A_tAI vehicle execution A_tThen, the state quantity S of the next step is obtained from the environment_t+1And R_t(ii) a Meanwhile, the AI vehicle sends S through the communication module_t、R_t-1Sending the data to a training module running at a server end, and receiving S by the training module_tAnd R_t-1And then adding the model into an experience pool for model training and carrying out model training. The training module periodically pushes the updated model to the action generation module of the unmanned vehicle through the communication module.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for obtaining a multi-dimensional random distribution for reinforcing a controller model, comprising:

acquiring historical driving data, wherein the historical driving data comprises bottom layer control instructions for executing target basic actions;

processing the historical driving data to obtain a plurality of control instruction samples, wherein each control instruction sample in the plurality of control instruction samples is a control instruction sequence which is formed by bottom layer control instructions with time sequence and is used for executing the target basic action;

obtaining an average control instruction according to the plurality of control instruction samples, wherein the average control instruction is used for indicating a control instruction sequence which is composed of bottom layer control instructions with control values at an average level and with time sequence and is used for executing the target basic action;

and obtaining multi-dimensional random distribution according to the average control instruction and the plurality of control instruction samples, wherein the multi-dimensional random distribution is an expected function distribution which is disturbed in a certain range around the average control instruction.

2. The method of claim 1, wherein prior to said deriving an average control command from said plurality of control command samples, further comprising:

counting an instruction length of a control instruction sequence of each control instruction sample in the plurality of control instruction samples, wherein the instruction length is used for indicating the number of bottom-layer control instructions with time sequence;

calculating an average instruction length according to the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples, wherein the average instruction length is used for indicating an average value, a median or a maximum value obtained according to the instruction length of the control instruction sequence in each control instruction sample in the plurality of control instruction samples;

processing an instruction length of a sequence of control instructions for each control instruction sample of the plurality of control instruction samples into the average instruction length;

the obtaining an average control instruction according to the plurality of control instruction samples comprises:

obtaining the average control instruction according to a plurality of control instruction samples after instruction length processing, wherein the instruction length of the average control instruction is the average instruction length;

the obtaining of the multi-dimensional random distribution according to the average control instruction and the plurality of control instruction samples includes:

and obtaining the multi-dimensional random distribution according to the average control instruction and a plurality of control instruction samples processed by the instruction length.

3. The method as claimed in claim 1 or 2, wherein the control instruction sequence of each of the plurality of control instruction samples comprises at least one control instruction string, each control instruction string in the at least one control instruction string is composed of bottom control instructions of the same type with timing, and the bottom control instructions of each control instruction string in the at least one control instruction string are equal in number and corresponding in timing.

4. The method of claim 3, wherein the instruction length is to indicate a number of underlying control instructions of any of the at least one control instruction string.

5. The method of any one of claims 1-2, wherein the floor control commands include one or more of an acceleration parameter, a steering angle parameter, and a braking parameter.

6. The method of any one of claims 1-2, further comprising reinforcement learning;

the reinforcement learning includes:

generating a first control command according to the multi-dimensional random distribution;

acquiring current road condition state data and inputting the current road condition data into the controller model to generate a second control instruction;

determining that the first control instruction or the second control instruction is an actual control instruction, wherein the actual control instruction is used for controlling a target vehicle to execute the target basic action;

controlling the target vehicle to execute the target basic action according to the actual control instruction;

obtaining a return parameter value according to the road condition data after the target vehicle executes the target basic action;

and correcting the control parameter value of the controller model according to the return parameter value.

7. The method of claim 6, wherein the determining that the first control directive or the second control directive is an actual control directive comprises:

and randomly determining the first control instruction or the second control instruction as the actual control instruction according to a probability, wherein the first control instruction is determined as the actual control instruction according to a first probability, and the second control instruction is determined as the actual control instruction according to a second probability.

8. The method of claim 7, wherein the reinforcement learning is performed repeatedly, wherein the first probability is smaller and smaller as a number of repetitions increases, and wherein the second probability is larger and larger as a number of repetitions increases.

9. The method of claim 7 or 8, wherein the sum of the first probability and the second probability is equal to 1.

10. The method of any one of claims 1-2, wherein the target base action is any one of lane change, passing, following, parking, straight going along a road axis.

11. An apparatus for obtaining a multi-dimensional stochastic distribution for augmenting a controller model, comprising:

the data acquisition module is used for acquiring historical driving data, and the historical driving data comprises a bottom layer control instruction for executing a target basic action;

the first processing module is used for processing the historical driving data acquired by the data acquisition module to obtain a plurality of control instruction samples, and each control instruction sample in the plurality of control instruction samples is a control instruction sequence which is formed by bottom layer control instructions with time sequences and is used for executing the target basic action;

the first calculation module is used for calculating an average control instruction according to a plurality of control instruction samples obtained by the first processing module, and the average control instruction is used for indicating a control instruction sequence which is formed by bottom layer control instructions with control values at an average level and time sequence;

and the second calculation module is used for calculating multi-dimensional random distribution according to the average control instruction obtained by the first calculation module and the plurality of control instruction samples obtained by the first processing module, wherein the multi-dimensional random distribution is an expected function distribution which is disturbed in a certain range around the average control instruction.

12. The apparatus as recited in claim 11, further comprising:

the counting module is used for counting the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples obtained by the first processing module, and the instruction length is used for indicating the number of bottom layer control instructions with time sequence;

a third calculating module, configured to calculate an average instruction length according to the instruction length of the control instruction sequence of each control instruction sample in the plurality of control instruction samples obtained by the counting module, where the average instruction length is used to indicate an average value, a median, or a maximum value obtained according to the instruction length of the control instruction sequence in each control instruction sample in the plurality of control instruction samples;

a second processing module, configured to process the instruction length of the control instruction sequence of each control instruction sample in the multiple control instruction samples obtained by the first processing module into an average instruction length obtained by the third calculating module;

the first calculation module is specifically configured to:

calculating the average control instruction according to a plurality of control instruction samples obtained by the second processing module after instruction length processing, wherein the instruction length of the average control instruction is the average instruction length;

the second calculation module is specifically configured to:

and obtaining the multi-dimensional random distribution according to the average control instruction obtained by the first calculating module and a plurality of control instruction samples processed by the instruction length obtained by the second processing module.

13. The apparatus according to claim 11 or 12, wherein the control instruction sequence of each of the plurality of control instruction samples comprises at least one control instruction string, each of the at least one control instruction string is composed of bottom layer control instructions of a same type with a timing sequence, and the bottom layer control instructions of each of the at least one control instruction string are equal in number and corresponding in timing sequence.

14. The apparatus of claim 13, wherein the instruction length is to indicate a number of underlying control instructions of any of the at least one control instruction string.

15. The apparatus of any one of claims 11-12, wherein the floor control commands include one or more of an acceleration parameter, a rotation angle parameter, and a braking parameter.

16. The apparatus of any one of claims 11-12, further comprising a reinforcement learning module;

the reinforcement learning module includes:

the first action generating unit is used for generating a first control instruction according to the multi-dimensional random distribution;

the environment unit is used for acquiring current road condition state data and inputting the current road condition data into the controller unit;

the controller unit is used for receiving the current road condition state data input by the environment unit and generating a second control instruction;

a determination unit configured to determine that the first control instruction generated by the first action generation unit or the second control instruction generated by the controller unit is an actual control instruction, where the actual control instruction is used to control a target vehicle to execute the target basic action;

the execution unit is used for controlling the target vehicle to execute the target basic action according to the actual control instruction determined by the determination unit;

the return parameter calculating unit is used for obtaining a return parameter value according to the road condition data after the executing unit controls the target vehicle to execute the target basic action;

and the correcting unit is used for correcting the control parameter value of the controller unit according to the return parameter value obtained by the return parameter calculating unit.

17. The apparatus as claimed in claim 16, wherein said determining unit is specifically configured to:

and randomly determining, according to a probability, that the first control command generated by the first action generation means or the second control command generated by the controller means is the actual control command, wherein the first control command is determined to be the actual control command with a first probability and the second control command is determined to be the actual control command with a second probability.

18. The apparatus of claim 17, further comprising:

the probability updating unit is used for updating the first probability and the second probability, wherein the first probability is smaller and smaller as the correction times increase, and the second probability is larger and larger as the correction times increase.

19. The device according to any one of claims 11-12, wherein the target basic action is any one of lane change, passing, following, parking, straight going along a road axis.

20. An automobile, comprising: the device of any one of claims 11-19.

21. A storage medium having stored thereon programmable instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-10.