CN113879339A

CN113879339A - Decision planning method for automatic driving, electronic device and computer storage medium

Info

Publication number: CN113879339A
Application number: CN202111481018.4A
Authority: CN
Inventors: 陈俊波; 雷岚馨; 敬巍; 王刚
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-01-04
Also published as: WO2023103692A1

Abstract

The embodiment of the application provides a decision planning method for automatic driving, electronic equipment and a computer storage medium, wherein the decision planning method for automatic driving comprises the following steps: acquiring driving perception information of an object to be decided in a continuous behavior space, wherein the driving perception information comprises: geometric information, historical driving track information and map information related to the object to be decided; obtaining a plurality of planning strategies which accord with Gaussian mixture distribution and strategy evaluation corresponding to each planning strategy according to the driving perception information and the driving target information; and performing decision planning on the object to be decided according to the planning strategies and the strategy evaluation corresponding to each planning strategy. By the embodiment of the application, decision planning can be effectively performed under a strong interaction scene in automatic driving, and the decision effect is improved.

Description

Decision planning method for automatic driving, electronic device and computer storage medium

Technical Field

The embodiment of the application relates to the technical field of automatic driving, in particular to a decision planning method for automatic driving, electronic equipment and a computer storage medium.

Background

The automatic driving technology is a technology for performing real-time and continuous control on corresponding equipment (such as an automatic driving vehicle, an unmanned aerial vehicle, a robot and the like) by adopting communication, computer, network and control technologies.

With the development of the automatic driving technology, the driving decision planning is increasingly applied to the automatic driving technology. Taking an automatically driven vehicle as an example, the current automatic driving decision plan can give specific driving suggestions according to the change of road conditions, such as the condition of meeting pedestrians, the condition of vehicle occurrence or traffic jam and the like, and control the automatically driven vehicle to carry out reasonable driving operation. However, in some scenarios, such as a strong interaction scenario of automatic driving, an effective decision plan cannot be given due to problems such as fine granularity of data.

The strong interaction scene in automatic driving means a scene in which a decision plan of the driver needs to be frequently adjusted based on a decision of the other party in automatic driving, and the scene often exists in a low-speed congestion scene, such as narrow-lane vehicle meeting, narrow-lane vehicle passing, roundabout and the like. Under these scenarios, traditional decision planning is difficult to work. Therefore, how to perform effective decision planning in a strong interaction scene in automatic driving becomes an urgent problem to be solved.

Disclosure of Invention

In view of the above, embodiments of the present application provide a decision planning scheme for automatic driving to at least partially solve the above problems.

According to a first aspect of embodiments of the present application, there is provided a decision planning method for automatic driving, including: acquiring driving perception information of an object to be decided in a continuous behavior space, wherein the driving perception information comprises: geometric information, historical driving track information and map information related to the object to be decided; obtaining a plurality of planning strategies which accord with Gaussian mixture distribution and strategy evaluation corresponding to each planning strategy according to the driving perception information and the driving target information; and performing decision planning on the object to be decided according to the planning strategies and the strategy evaluation corresponding to each planning strategy.

According to a second aspect of embodiments of the present application, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the method according to the first aspect.

According to a third aspect of embodiments herein, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the method according to the first aspect.

According to a fourth aspect of embodiments herein, there is provided a computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to the method according to the first aspect.

According to the decision planning scheme for automatic driving provided by the embodiment of the application, aiming at a strong interaction scene of automatic driving, on one hand, driving perception information of a continuous behavior space is used, and besides continuity, the part of information also has finer data granularity due to the continuity, so that a planning strategy determined based on the information also has finer granularity, and the scheme is more suitable for automatic driving decision processing under the strong interaction scene. On the other hand, the planning strategies obtained aiming at the strong interaction scene comprise a plurality of planning strategies, and the planning strategies conform to Gaussian mixture distribution, so that the planning strategies have higher executability and rationality, can effectively deal with and process different operation conditions of each other, and better conforms to the requirement of strong interaction. Therefore, through the scheme of the embodiment of the application, decision planning can be effectively carried out under a strong interaction scene in automatic driving, and the decision effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a schematic diagram of a reinforcement learning system;

FIG. 2 is a schematic structural diagram of a reinforcement learning network model according to an embodiment of the present application;

FIG. 3A is a flowchart illustrating steps of a method for decision planning for autonomous driving according to an embodiment of the present disclosure;

FIG. 3B is a diagram illustrating an example of a scenario in the embodiment shown in FIG. 3A;

FIG. 4A is a flowchart illustrating steps of a decision-making planning method for automatic driving according to a second embodiment of the present application;

FIG. 4B is a diagram of a MCTS-based reinforcement learning network model in the embodiment of FIG. 4A;

fig. 5 is a block diagram of an automatic driving decision-making planning apparatus according to a third embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

To facilitate understanding of aspects of embodiments of the present application, a brief schematic description of a reinforcement learning system is provided below, as shown in fig. 1.

Reinforcement learning is the process by which an agent interacts with the environment, thereby continuously reinforcing the agent's decision-making ability. The reinforcement learning system shown in fig. 1 includes an environment (Env) and an Agent (Agent). First, the environment gives the agent an observation (also called state); the agent will make an action (action) after receiving the observed value given by the environment; the environment receives the action given by the agent and then makes a series of reactions, such as giving a reward (reward) to the action and giving a new observation value; the agent updates its policy (policy) according to the reward value given by the environment, so as to finally obtain the optimal policy by continuously interacting with the environment.

In practical applications, the reinforcement learning system can be realized by a strategy value model, which comprises strategy branches and value branches. The strategy branch is used for the agent to select the next action based on the state, and can be realized in various ways, such as through the behavior function of the agent, and the like, and also can be realized through MCTS (Monte Carlo Tree search) in the reinforcement learning based on MCTS. The value branch is used to derive a desire for a jackpot when a state follows a policy selected by the policy branch. Reward is a feedback signal, usually a numerical value, indicating how well the agent performs the action based on the state selection.

Specifically, in the embodiment of the present application, an MCTS-based reinforcement learning system is adopted, and the system is collectively referred to as a reinforcement learning network model in the embodiment of the present application. As shown in fig. 2, the reinforcement learning network model includes a GNN (graph neural network model) portion and a policy value model portion, the policy value model portion includes a policy branch and a value branch, the policy branch generates a corresponding planning policy based on MCTS, and the value branch performs value evaluation on the planning policy generated based on MCTS and a result generated corresponding to the planning policy, so as to obtain an evaluation value of the corresponding policy evaluation.

Note that the GNN part of the reinforcement learning network model shown in fig. 2 is an optional part, and is used to perform feature extraction on input information. However, in practical applications, the corresponding input information may be directly input to the policy value model portion together with the target information (target information) shown in fig. 2. Alternatively, other network models may be used instead of the GNN model to perform feature extraction of the input information, such as CNN. However, by using the GNN, features of input information, especially image features in a strong interactive scene of automatic driving, can be extracted better and more effectively.

Based on the above structure, the following describes a decision planning scheme for automatic driving according to embodiments of the present application with reference to the accompanying drawings.

Example one

Referring to fig. 3A, a flow chart illustrating steps of a method for decision planning for automatic driving according to an embodiment of the present application is shown.

The decision planning method for automatic driving of the embodiment comprises the following steps:

step S302: and acquiring the driving perception information of the object to be decided in the continuous behavior space.

The object to be decided may be an intelligent device (such as a processor or a chip) and may perform corresponding operations according to instructions of the intelligent device, such as an apparatus for executing instruction operations corresponding to the decision plan; the intelligent agent device can also be a device which can upload corresponding information to a remote intelligent agent device (such as a server) and can receive an instruction of the intelligent agent device to perform corresponding operation. In the embodiment of the present application, the realizable form of the object to be decided includes, but is not limited to: the decision-making system comprises a vehicle, an unmanned aerial vehicle, a robot and the like which can automatically run, and the specific implementation form of the object to be decided is not limited in the embodiment of the application.

The driving behaviors of the objects to be decided are generally continuous, but the processing of the corresponding data can be divided into data processing based on a discrete behavior space and data processing based on a continuous behavior space. The data in the continuous action space also has continuity and is continuous, so that a series of information such as the state, operation and the like of the object to be decided at each moment in the space can be more accurately reflected. In the embodiment of the application, the driving perception information of the object to be decided in the continuous behavior space is mainly obtained. The driving sensation information includes at least: geometric information, historical travel track information and map information related to the object to be decided.

The geometric information related to the object to be decided includes geometric information of the object to be decided (such as information of a contour and a shape of the object to be decided), and geometric information of a physical object in an environment where the object to be decided is located (such as information of a contour and a shape of other objects around the object to be decided, such as vehicles, obstacles, road facilities, and the like). The historical driving track information related to the object to be decided includes the information of the driving track of the object to be decided in a preset time period before the current time, wherein the specific setting of the preset time period can be set by a person skilled in the art according to the actual requirement, for example, 3 to 5 seconds before the current time. The map information related to the object to be decided is usually map information in a geographic area range where the current position of the object to be decided is located, such as map information of a certain area range around the current position, or information of the geographic area where the object to be decided is located, and usually includes data of a road where the object to be decided is located and a topological structure of a surrounding road. The driving perception information can be acquired and processed through information acquisition equipment, such as a camera, a radar, various sensors and the like, in the object to be decided, and the current driving state of the object to be decided can be comprehensively and accurately described through the driving perception information.

Step S304: and obtaining a plurality of planning strategies according with Gaussian mixture distribution and strategy evaluation corresponding to each planning strategy according to the driving perception information and the driving target information.

Wherein the driving target information is used for characterizing information related to the driving target, in a strong interaction scenario, generally including but not limited to: information of a target point (or a target area), a position of the target point (or the target area), a distance from the target point (or the target area) to the current position of the object to be decided, the current self speed and state of the object to be decided, an angle and the like. The target point (or target area) in the strong interaction scene may be a target point (or target area) that is closer to the current position of the object to be decided, and may be reached by the object to be decided by performing a few operations (e.g., 1 to 3 times). For example, in a meeting scenario, the target point (or target area) may be a nearby target location that is ahead of the current location, at an angle to the opponent's vehicle to avoid a collision, and so on. Through the driving target information, the driving target of the object to be decided can be effectively determined, so that an effective basis is provided for subsequently determining the planning strategy.

With the driving perception information and the driving target information, the planning strategy (including but not limited to navigation, braking, acceleration, following, lane changing, etc.) for operating and controlling the object to be decided can be obtained through a proper algorithm or model, such as a reinforcement learning network model, etc. In the embodiment of the application, the planning strategy comprises a plurality of planning strategies, and the planning strategies conform to Gaussian mixture distribution. In the embodiment of the present application, a Gaussian mixture distribution means a probability distribution output by a Gaussian Mixture Model (GMM), which is a linear combination of multiple Gaussian distribution functions, and therefore corresponds to multiple peaks corresponding to multiple planning strategies. In addition, the effectiveness of each planning strategy also needs to be evaluated, so that for a plurality of planning strategies, the strategy evaluation can be obtained, for example, the evaluation can be specifically realized as evaluation, scoring, evaluation of the degree of merit and the like, so as to determine the degree of effectiveness that the object to be decided may achieve if the planning strategy is executed.

In one possible approach, a plurality of planning strategies and an evaluation of the individual planning strategies may be obtained by means of a strategy value model. For example, a policy and value model generally includes a policy network part and a value network part, wherein the policy network part is in the form of a Mixed Density Network (MDN) and is used to output a plurality of planning policy indicators (such as probability distribution information) conforming to a mixed gaussian distribution, and the value network part can evaluate a plurality of planning policies generated according to the planning policy indicators output by the policy network part and output policy evaluations corresponding to the planning policies, and in this way, a plurality of planning policies and their corresponding policy evaluations can be efficiently and quickly generated, and an effective basis is provided for the subsequent decision-making planning of objects to be decided, wherein the plurality of planning policies generated according to the planning policy indicators output by the policy network part can be implemented by those skilled in the art according to the actual conditions using an appropriate algorithm, and in a feasible way, probability distributions based on a mixed gaussian distribution can be generated in conjunction with MCTS.

In addition, in a feasible manner of the embodiment of the application, a plurality of planning strategies conforming to the Gaussian mixture distribution can be obtained directly based on the driving perception information and the driving target information. But not limited to this, in another feasible manner, a plurality of planning strategies conforming to the gaussian mixture distribution may be obtained based on the characteristic data of the driving perception information and the driving target information, and the characteristic data may be data obtained by performing characteristic extraction and fusion on a plurality of driving perception information, so that the driving perception condition of the object to be decided may be comprehensively represented, and the characteristics of the driving perception condition may be more emphasized. Alternatively, the extraction and generation of feature data of the driving sensation information may be implemented by a Graph Neural Network (GNN). And inputting the driving perception information into the GNN to perform feature extraction and feature fusion based on the multi-head self-attention mechanism through the GNN, and obtaining a fusion feature vector corresponding to the driving perception information.

In the conventional GNN, the input data is processed as a whole, the data processing amount is large, and the output of the GNN is combined with a reinforcement learning network model, especially when the reinforcement learning network model based on MCTS is combined, because the MCTS needs to repeatedly execute simulation and inference processes, the processing burden of GNN is further increased. In order to reduce the data processing burden of the GNN and improve the processing efficiency, in an embodiment of the present application, the GNN is set to include a geometry sub-layer, a driving track sub-layer, a map sub-layer, a pooling layer, and a global layer.

The geometric sub-graph layer is used for extracting the characteristics of geometric information, the driving track sub-graph layer is used for extracting the characteristics of historical driving track information, and the map sub-graph layer is used for extracting the characteristics of map information; the pooling layer is used for respectively carrying out feature aggregation on features extracted from the geometric sub-graph layer, the driving track sub-graph layer and the map sub-graph layer; and the global layer is used for carrying out multi-head self-attention processing on the aggregated features respectively obtained by the geometric sub-layer, the driving track sub-layer and the map sub-layer to obtain a fusion feature vector.

Further, adopting vectors corresponding to coordinates of four corners and a center of the object to be decided for the geometric information; for historical driving track information, adopting time series coding vectors of the position and the orientation of an object to be decided in five time steps nearest to the current time; the method comprises the steps of firstly discretizing a road boundary by adopting road topological data for map information, dividing the road boundary into sub-images at intervals of 5M so as to form different sub-images, and then connecting the road boundary point to construct a road information vector. Therefore, the GNN processing is further facilitated, and the GNN processing efficiency is improved.

As described above, different driving perception information is processed through different sub-layers, but in each sub-layer, the corresponding input vector is firstly subjected to feature extraction through a full connection layer, and then all feature data from different nodes processed at this time are aggregated through a maximum pooling layer, so as to obtain aggregated features. For example, the geometric sub-layer respectively performs feature extraction on a plurality of geometric information processed this time and then performs feature aggregation to obtain aggregation features corresponding to the plurality of geometric information; the driving track sub-layer respectively extracts the features of a plurality of pieces of historical driving track information processed this time and then performs feature aggregation to obtain aggregation features corresponding to the plurality of pieces of historical driving track information; and respectively carrying out feature extraction on the plurality of road topology information processed this time by the map sub-layer, and then carrying out feature aggregation to obtain aggregation features corresponding to the plurality of road topology information. The aggregated features output by the sub-graph layers may have a fixed vector length, and then the aggregated features are input into the global graph layer. The global layer can be realized based on a multi-head self-attention mechanism, and after the global layer performs multi-head self-attention processing on the aggregation features input by the three sub-layers, fusion feature vectors can be obtained.

From the above description, in practical applications, the above improvements to the GNN and the strategic value model can be used alternatively, but can also be used simultaneously, so as to combine the advantages of the two, thereby achieving a better decision planning effect. Based on this, in one possible approach, the present steps can be implemented as: inputting the driving perception information into GNN to perform feature extraction and feature fusion based on a multi-head self-attention mechanism through the GNN to obtain a fusion feature vector corresponding to the driving perception information; and inputting the vector corresponding to the driving target information of the fusion characteristic vector and the object to be decided into a strategy value model, obtaining a plurality of planning strategy instructions according with mixed Gaussian distribution through the strategy value model, and generating strategy evaluation corresponding to each planning strategy according to the planning strategy instructions.

Through the process, a plurality of planning strategies and corresponding strategy evaluation are obtained, and further decision planning can be carried out based on the planning strategies.

Step S306: and performing decision planning on the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy.

Because each planning strategy corresponds to a corresponding strategy evaluation, a more preferable planning strategy can be selected according to the evaluation value or grade level or the quality degree of the strategy evaluation, and then decision planning is carried out based on the selected planning strategy, for example, an operation instruction is sent to an object to be decided, such as the use degree of an accelerator and an accelerator, the use of a brake, the rotation angle of a steering wheel and the like, so that the operation of the object to be decided under a strong interaction scene is indicated, and effective decision is carried out.

In the embodiments of the present application, the numbers "plural" and "plural" relating to "plural" mean two or more unless otherwise specified.

Hereinafter, the above-described process is exemplified by a specific example, as shown in fig. 3B.

In this example, the object to be determined is set as an autonomous vehicle X, and it is assumed that the vehicle X meets a manually driven vehicle Y on a narrow road. As shown in fig. 3B, the driving sensation information of the vehicle X in the continuous behavior space is first acquired, including the contour information of the vehicle X, the contour information of the vehicle Y, the contour information of the road edge, and the like. Then, the information of the driving target of the vehicle X is obtained, and since the vehicle X needs to meet the vehicle Y, the driving target can be located 2 meters in front of the vehicle head, and the vehicle Y is close to the side of the vehicle X and at a position 30 degrees in the vehicle body direction, as shown by the solid dots in fig. 3B. Both the driving awareness information and the driving target information are input to an agent carried by the vehicle X, such as a controller of the vehicle X. A strategy value model is set in the controller, and a plurality of planning strategies, in this example, three planning strategies, namely planning strategies 1, 2 and 3 are assumed to be output according to the input driving perception information and driving target information. Suppose that planning strategy 1 is to go straight for 1 meter and then go 1 meter to the left front, planning strategy 2 is to go straight for 1 meter to the left front and then 1 meter, and planning strategy 3 is to go 2 meters to the left front. It is further assumed that the controller predicts the policy evaluation of the planning policy 1 specifically as a policy score through the policy value model, where the policy score is 0.6, the policy score of the planning policy 2 is 0.8, and the policy score of the planning policy 3 is 0.5. Based on the above assumptions, the controller determines to use the planning strategy 2, uses the planning strategy 2 as a decision rule, and generates a command corresponding to the decision plan for the vehicle X, such as making the steering wheel turn left by 30 degrees and making the accelerator decrease by 30% of the accelerator distance, and after driving 1 meter in this state, making the steering wheel return to the forward 0 degree, and keeping the accelerator driving for 1 meter again. To this end, vehicle X will meet vehicle Y with the new position of vehicle X as shown in FIG. 3B for vehicle X.

Therefore, according to the embodiment, for a strong interaction scene of automatic driving, on one hand, driving perception information of a continuous behavior space is used, and besides continuity, the part of information also has finer data granularity due to the continuity, so that a planning strategy determined based on the information also has finer granularity, and is more suitable for decision processing in the strong interaction scene. On the other hand, the planning strategies obtained aiming at the strong interaction scene comprise a plurality of planning strategies, and the planning strategies conform to Gaussian mixture distribution, so that the planning strategies have higher executability and rationality, can effectively deal with and process different operation conditions of each other, and better conforms to the requirement of strong interaction. Therefore, through the scheme of the embodiment, decision planning can be effectively carried out under a strong interaction scene in automatic driving, and the decision effect is improved.

Example two

Referring to fig. 4A, a flowchart illustrating steps of a decision planning method for automatic driving according to a second embodiment of the present application is shown.

In this embodiment, the reinforced learning network model and the MCTS are mainly described in combination for training, and the trained reinforced learning network model can be applied to the decision planning scheme for automatic driving in the foregoing embodiment, so as to perform effective decision planning on an object to be decided.

step S402: and training a strategy value model in the reinforcement learning network model.

As shown in fig. 2, the reinforcement learning network model in the embodiment of the present application includes a GNN model part and a strategic value model part, where the GNN part may be a model trained in advance, and therefore, the training of the strategic value model is described in this embodiment.

Specifically, the policy value model may be trained based on decision planning supervision information generated by the MCTS. An MCTS-based reinforcement learning network model is shown in fig. 4B, and as can be seen from fig. 4B, a policy branch P and a value branch V of a policy value model in the reinforcement learning network model are both realized based on MCTS, in the embodiment of the present application, input of the policy value model is driving perception information and driving target information, output of the policy value model is probability and evaluation of each feasible action (planning policy in the embodiment) under the input information condition, and training aims to make the probability of action output by the policy value model closer to the probability of MCTS output, so that output of the MCTS can also be considered as supervisory information of the policy value model.

The traditional strategy value model based on MCTS is mostly used for a discrete behavior space, is easy to block in certain scenes except for insufficient action fineness, and cannot be applied to the scheme of the embodiment of the application. Therefore, when the strategy value model is trained, in each iterative training, the embodiment of the application obtains the information (such as probability) of a plurality of planning strategy samples output by the MCTS based on the driving perception sample data, the driving target sample information and the KR-AUCB (based on the upper limit of the progressive confidence coefficient of the kernel regression) of the continuous behavior space; and (3) taking the information (such as probability) of the planning strategy sample with the highest strategy evaluation in the plurality of planning strategy samples as supervision information, and training the strategy value model.

MCTS is a planning algorithm that constructs tree structures based on the monte carlo method to reason about and explore, and can be usually combined with neural network models and reinforcement learning. MCTS typically includes several processes, select, expand, evaluate, and backup. Wherein the select process needs to recursively select a certain child node starting from the root node R of the monte carlo tree until reaching the leaf node L. In the process, how to select the next node based on the current node is referred to, and a currently common mode is UCB (Upper Confidence bound), but the UCB mode is inefficient, and for this reason, the embodiment of the present application provides an efficient node selection mode, namely KR-AUCB, which can be used in the select process and the expand process, which will be described in detail below. The expand process is a process of newly creating one child node C under the leaf node L if the operation on the leaf node L is not finished (if the travel is to be continued). In a conventional manner, only one new child node C is usually created, so that the newly created node data and the generated node path (policy) thereof are limited, and for this reason, the embodiment of the present application also improves the new node data and the generated node path (policy) thereof, so that multiple child nodes can be created based on the probability distribution of the GMM output by policy branch to expand the generated policy and improve the policy generation efficiency, which will be described in detail below. The evaluation process simulates corresponding actions from the positions of the newly expanded child nodes to a final result according to the generated strategy, so as to calculate the quality degree of the newly created nodes. And the backup process reversely transfers the data along the transfer path according to the quality of the newly created node, and updates the quality of the superior node of the newly created node.

Hereinafter, the scheme of the embodiment of the present application will be described in detail with respect to the modification of the select process and the expand process.

In a feasible manner provided by the embodiment of the present application, the obtaining of the MCTS based on the driving perception sample data, the driving target sample information, and the KR-AUCB (Kernel Regression-based approximated adaptive PUCB, based on the upper bound of the progressive confidence of the Kernel Regression) of the continuous behavior space may include: based on the driving perception sample data and the driving target sample information of the continuous behavior space, selecting nodes from corresponding MCT (Monte Carlo tree) by using KR-AUCB to form an initial planning strategy path; creating a plurality of child nodes for leaf nodes of an initial planning strategy path according to a plurality of action samples which are output by the enhanced network model and accord with mixed Gaussian distribution; obtaining a plurality of extended planning strategy paths based on the plurality of created child nodes and the initial planning strategy path; carrying out planning strategy simulation on a plurality of extension planning strategy paths to obtain strategy evaluation corresponding to each extension planning strategy path; and outputting the information of a plurality of planning strategy samples according to each extended planning strategy path and the corresponding strategy evaluation.

In each iterative training, selecting a node from a corresponding MCT by using KR-AUCB to form an initial planning strategy path may include: firstly, selecting a node from the MCT, wherein the node is usually the node with the maximum KR-AUCB value (initially, the node is an unvisited node, and after a policy value model is subjected to iterative training for multiple times, the node with the maximum KR-AUCB value may be the unvisited node or the visited node); aiming at each level of non-leaf nodes of at least one level of non-leaf nodes corresponding to the node, selecting non-leaf node nodes with KR-AUCB values higher than other same-level child nodes or with access times lower than other same-level child nodes; selecting a leaf node (which may be a maximum leaf node or a randomly selected leaf node) based on the leaf node corresponding to the last level non-leaf node in the at least one level of non-leaf nodes; and forming an initial planning strategy path according to the selected nodes of each level. Wherein, the KR-AUCB value can be calculated according to the following formula I.

As shown in fig. 4B, in the selection process of the MCTS, a node is selected through KR-AUCB based on the driving perception sample data and the driving target sample information of the continuous behavior space.

In the KR-AUCB mode, firstly, selecting an unvisited node; aiming at each level of non-leaf nodes of at least one level of non-leaf nodes corresponding to the node which is not accessed, selecting non-leaf node nodes with prior probability higher than other same level child nodes or access times lower than other same level child nodes; and selecting the maximum leaf node based on the leaf nodes corresponding to the last level non-leaf node in the at least one level non-leaf node.

Optionally, KR-AUCB may be expressed in the form of the following equation:

formula one

Wherein:

formula two

Formula three

Formula four

Formula five

In the above-mentioned formula,

indicating the selected action (e.g., the action node in the MCT),

indicating the existing sibling action,

representing gaussian probabilityThe density of the mixture is higher than the density of the mixture,

is shown in

In that

The following expectations are that,

for value branch in policy value model

Is then outputted from the output of (a),

representing the extension parameters for an extension node,

represents a priori strategy for asymptotically controlling the propagation attenuation, W () represents the number of node accesses,

indicating the number of nodes.

Represents an action space

The uniformity of the distribution of the water content in the water,

the probability distribution of the prior action of the strategy branch output in the strategy value model is shown,

representing the output of the policy branch.

In making the initial node selection, the nodes in the MCT may be selected based on equation one above to form an initial planning strategy path, such as the path formed by light gray nodes, indicated by the solid line on the left in fig. 4B, where the leaf nodes are nodes indicated by solid circles on the bottom left line, and are denoted as L.

It should be noted that the above formula is also applicable to the expanded process of MCTS, and the node can be selected from the newly created nodes more efficiently.

After the initial planning strategy path is formed by selecting nodes from the MCT through the above process, the select process of the MCTs may be considered to be completed. Further, an expand process may be performed.

In the expand process of the embodiment of the present application, the node also needs to be expanded, that is, a lower-level child node is newly created based on the leaf node. Different from the traditional MCTS, a child node is newly built each time, in the embodiment of the application, a plurality of child nodes can be built at one time based on the mixed Gaussian distribution output by the strategy branches in the strategy value model.

To facilitate the description of this process, the policy branch in the embodiment of the present application will be described first. The policy branch in the embodiment of the present application is implemented by a Mixed Density Network (MDN), which models a probability distribution output by action, i.e., a mixed Gaussian distribution, by outputting parameters of a Gaussian Mixture Model (GMM). As can be seen in connection with the middle MCT in fig. 4B, applying this probability distribution of the output of the policy branches to the expand process of MCTs will create multiple child nodes, 2 for example in fig. 4B, for node L at the same time.

In the embodiment of the application, when the node is selected from the newly created child nodes, the traditional Bayesian reasoning method is improved, so that the effective node can be selected more quickly. Based on this, in a feasible manner, obtaining a plurality of extended planning strategy paths based on the created plurality of child nodes and the initial planning strategy path may include: fitting the information of each child node in the created child nodes by using a Gaussian process function, and obtaining the candidate degree of the child node according to the mean value and the standard deviation of the Gaussian process after fitting and the distance between the child node and other child nodes; selecting candidate child nodes from the plurality of child nodes according to the candidate degree of each child node; and obtaining a plurality of extended planning strategy paths according to the selected candidate child nodes and the initial planning strategy path. And obtaining the candidate degree of the child node according to the fitted Gaussian process mean value, standard deviation and the distance between the child node and other child nodes, wherein the candidate degree of the child node can also be regarded as a potential energy function construction process of a Bayesian inference mode, and the result is used as the candidate degree of the child node.

One specific example process is as follows:

taking node b as an example, the node b is subjected to node expansion, and a new child node is created for the node b. Firstly, the action node is collected by the existing action (expressed as a in formula six) branch Ch (b)

Alternatively, the formula is represented as follows:

formula six

Where a () represents an acquisition function for directing acquisition to regions where the probability of finding the best node may increase.

Based on this, the acquisition function is defined as:

formula seven

Wherein the content of the first and second substances,

an action node representing a candidate (e.g. an action node in the MCT),

and

representing candidates

The GP posterior mean and standard deviation of (a),

to represent

And a in the other branches, wherein,

。

equation eight

The first two terms in formula seven may be considered choices

The last item is a penalty item for penalizing that the action of the candidate node is too close to the existing action.

And

the method is an adjustable coefficient and is used for balancing node expansion.

After the candidate action is collected, the candidate action is added to the accessed action node set

In (1). Then, an action node to be taken is selected based on the KR-AUCB. If the collected result is not a candidate, then

Deleting a. And then, executing the action corresponding to the selected action node, and generating a new state at the next node level. Through one timeAfter iteration, expected values of leaf nodes

Given by the value branch in the policy value model. And finally, updating the value of each node in the whole traversal process in a back propagation mode.

Therefore, based on the above process, under the framework of selection, expand, evaluate and backup of MCTS, the function of the Gaussian process is used for fitting the information of the child nodes, and the most potential new child nodes are deduced by using the improved scheme of Bayesian inference, so that the information utilization rate and efficiency of the expand process are effectively improved. In addition, if the newly created node is not selected in the selection process, the newly created node is deleted, and the mode can avoid the dependence on the preset hyperparameter in the expansion process. Finally, the MCTS process described above improves the accuracy of the evaluate process and the accuracy of the select process as a whole.

Based on the above process, a relatively good continuous decision track, namely a planning strategy, is generated by means of the deduction of the MCTS, and is used as supervision information to supervise the reinforcement learning of the strategy value model.

In model training, each step of MCTS will deduce a plurality of planning strategies (e.g., 100), wherein the most visited one is used to train a strategy value model.

After the model training is complete, it can be applied to the actual decision plan, as described in the following steps.

Step S404: and acquiring the driving perception information of the object to be decided in the continuous behavior space.

Wherein the driving perception information includes: geometric information, historical travel track information and map information related to the object to be decided.

Step S406: and inputting the driving perception information into GNN to perform feature extraction and feature fusion based on a multi-head self-attention mechanism through the GNN so as to obtain a fusion feature vector corresponding to the driving perception information.

In this embodiment, a mode of processing the driving perception information by using the GNN is adopted to extract the relevant features more efficiently.

Step S408: and inputting the vector corresponding to the driving target information of the fusion characteristic vector and the object to be decided into a strategy value model, obtaining a plurality of planning strategy instructions according with mixed Gaussian distribution through the strategy value model, and generating strategy evaluation corresponding to each planning strategy according to the planning strategy instructions.

In this step, the strategy branch in the strategy value model outputs GMM parameters to model probability distribution, i.e. gaussian mixture distribution, output by action, and a plurality of planning strategies are generated based on the distribution and the MCTS process. And evaluating the plurality of planning strategies through the value branches in the strategy value model to obtain strategy evaluation of each planning strategy. The specific implementation of the value branch may refer to an implementation manner of the value branch in the current policy value model, which is not limited in the embodiment of the present application.

Step S410: and performing decision planning on the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy.

For example, the planning strategy with the highest policy evaluation may be selected from a plurality of planning strategies to generate a decision plan. And then, carrying out operation indication on the object to be decided according to the decision plan.

The descriptions of the steps S404 to S410 are simple, and the specific implementation thereof can refer to the descriptions of the relevant steps in the first embodiment and the description of the step S402, which are not repeated herein.

According to the embodiment, for the strong interaction scene of automatic driving, on one hand, the driving perception information of the continuous behavior space is used, and besides continuity, the part of information also has finer data granularity due to the continuity, so that the planning strategy determined based on the information also has finer granularity, and is more suitable for decision processing under the strong interaction scene. On the other hand, the planning strategies obtained aiming at the strong interaction scene comprise a plurality of planning strategies, and the planning strategies conform to Gaussian mixture distribution, so that the planning strategies have higher executability and rationality, can effectively deal with and process different operation conditions of each other, and better conforms to the requirement of strong interaction. Therefore, through the scheme of the embodiment, decision planning can be effectively carried out under a strong interaction scene in automatic driving, and the decision effect is improved.

EXAMPLE III

Referring to fig. 5, a block diagram of a decision planning apparatus for automatic driving according to a third embodiment of the present application is shown.

The decision planning device for automatic driving of the embodiment comprises: a first obtaining module 502, configured to obtain driving perception information of an object to be decided in a continuous behavior space, where the driving perception information includes: geometric information, historical driving track information and map information related to the object to be decided; a second obtaining module 504, configured to obtain, according to the driving awareness information and the driving target information, a plurality of planning strategies conforming to gaussian mixture distribution and a policy evaluation corresponding to each planning strategy; and the planning module 506 is configured to perform decision planning on the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy.

Optionally, the second obtaining module 504 is configured to input the driving awareness information into a graph neural network model, so as to perform feature extraction and feature fusion based on a multi-head self-attention mechanism through the graph neural network model, and obtain a fusion feature vector corresponding to the driving awareness information; and inputting the vector corresponding to the fusion characteristic vector and the driving target information of the object to be decided into a strategy value model, and obtaining a plurality of planning strategy instructions conforming to Gaussian mixture distribution and strategy evaluation corresponding to each planning strategy generated according to the planning strategy instructions through the strategy value model.

Optionally, the policy value model comprises a policy network part and a value network part; the strategy network part is a mixed density network and is used for outputting a plurality of planning strategy instructions which accord with mixed Gaussian distribution; the value network part is used for evaluating a plurality of planning strategies generated according to the planning strategy instructions output by the strategy network part and outputting strategy evaluation corresponding to each planning strategy.

Optionally, the graph neural network model includes a geometry sub-layer, a driving track sub-layer, a map sub-layer, a pooling layer, and a global layer; wherein: the geometric sub-graph layer is used for extracting the characteristics of the geometric information, the driving track sub-graph layer is used for extracting the characteristics of the historical driving track information, and the map sub-graph layer is used for extracting the characteristics of the map information; the pooling layer is used for respectively performing feature aggregation on the features extracted from the geometry sub-layer, the driving track sub-layer and the map sub-layer; and the global layer is used for carrying out multi-head self-attention processing on the aggregated features respectively obtained by the geometric sub-layer, the driving track sub-layer and the map sub-layer to obtain a fusion feature vector.

Optionally, the decision planning apparatus for automatic driving of this embodiment further includes: and the training module 508 is configured to train the policy value model based on the decision planning monitoring information generated by the MCTS.

Optionally, the training module 508 is configured to, in each iterative training, obtain information of multiple planning strategy samples output by the MCTS based on the driving perception sample data, the driving target sample information, and the KR-AUCB in the continuous behavior space; and training the strategy value model by taking the information of the planning strategy sample with the highest strategy evaluation as supervision information in the planning strategy samples.

Optionally, the training module 508 obtains driving perception sample data, driving target sample information, and KR-AUCB of the MCTS based on the continuous behavior space, and the output information of the multiple planning strategy samples includes: based on the driving perception sample data and the driving target sample information of the continuous behavior space, selecting nodes from corresponding MCTs by using KR-AUCB to form an initial planning strategy path; creating a plurality of child nodes for leaf nodes of the initial planning strategy path according to a plurality of action samples which are output by the reinforced network model and accord with mixed Gaussian distribution; obtaining a plurality of expanded planning strategy paths based on the plurality of created child nodes and the initial planning strategy path; carrying out planning strategy simulation on a plurality of extension planning strategy paths to obtain strategy evaluation corresponding to each extension planning strategy path; and outputting a plurality of planning strategy samples according to each extended planning strategy path and the corresponding strategy evaluation.

Optionally, the training module 508 obtains a plurality of extended planning strategy paths based on the created plurality of child nodes and the initial planning strategy path, including: fitting the information of each child node in the created child nodes by using a Gaussian process function, and obtaining the candidate degree of the child node according to the mean value and the standard deviation of the Gaussian process after fitting and the distance between the child node and other child nodes; selecting candidate child nodes from the plurality of child nodes according to the candidate degree of each child node; and obtaining a plurality of extension planning strategy paths according to the selected candidate child nodes and the initial planning strategy path.

Optionally, the training module 508 forms the initial planning strategy path by selecting a node from the corresponding MCT using KR-AUCB, including: firstly, selecting an unvisited node from the MCT; aiming at each level of non-leaf nodes of at least one level of non-leaf nodes corresponding to the node which is not accessed, selecting non-leaf node nodes with prior probability higher than other same level child nodes or access times lower than other same level child nodes; selecting a maximum leaf node based on leaf nodes corresponding to the last level non-leaf node in the at least one level of non-leaf nodes; and forming an initial planning strategy path according to the selected nodes of each level.

The decision planning apparatus for automatic driving of this embodiment is used to implement the corresponding decision planning method for automatic driving in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the functional implementation of each module in the decision planning apparatus for automatic driving of this embodiment can refer to the description of the corresponding part in the foregoing method embodiment, and is not repeated here.

Example four

Referring to fig. 6, a schematic structural diagram of an electronic device according to a fourth embodiment of the present application is shown, and the specific embodiment of the present application does not limit a specific implementation of the electronic device.

As shown in fig. 6, the electronic device may include: a processor (processor)602, a communication Interface 604, a memory 606, and a communication bus 608.

Wherein:

the processor 602, communication interface 604, and memory 606 communicate with one another via a communication bus 608.

A communication interface 604 for communicating with other electronic devices or servers.

The processor 602 is configured to execute the program 610, and may specifically execute relevant steps in the above-described decision planning method for calibrating automatic driving.

In particular, program 610 may include program code comprising computer operating instructions.

The processor 602 may be a CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present application. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 606 for storing a program 610. Memory 606 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 610 may be specifically configured to cause the processor 602 to execute the decision-making planning method for automatic driving according to any one of the first or second embodiments.

For specific implementation of each step in the program 610, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing embodiment of the decision planning method for automatic driving, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and have corresponding effects, which are not described herein again.

The embodiment of the present application further provides a computer program product, which includes computer instructions for instructing a computing device to execute an operation corresponding to any one of the above-mentioned decision planning methods for automatic driving in the multiple method embodiments.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes a storage component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the decision-making planning method for autonomous driving described herein. Further, when a general purpose computer accesses code for implementing the decision-making planning method for autonomous driving illustrated herein, execution of the code transforms the general purpose computer into a special purpose computer for performing the decision-making planning method for autonomous driving illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. A decision-making planning method for autonomous driving, comprising:

acquiring driving perception information of an object to be decided in a continuous behavior space, wherein the driving perception information comprises: geometric information, historical driving track information and map information related to the object to be decided;

obtaining a plurality of planning strategies which accord with Gaussian mixture distribution and strategy evaluation corresponding to each planning strategy according to the driving perception information and the driving target information;

and performing decision planning on the object to be decided according to the planning strategies and the strategy evaluation corresponding to each planning strategy.

2. The method according to claim 1, wherein the obtaining a plurality of planning strategies conforming to a Gaussian mixture distribution and a strategy evaluation corresponding to each planning strategy according to the driving perception information and the driving target information comprises:

inputting the driving perception information into a graph neural network model to perform feature extraction and feature fusion based on a multi-head self-attention mechanism through the graph neural network model to obtain fusion feature vectors corresponding to the driving perception information;

and inputting the vector corresponding to the fusion characteristic vector and the driving target information of the object to be decided into a strategy value model, and obtaining a plurality of planning strategy instructions conforming to Gaussian mixture distribution and strategy evaluation corresponding to each planning strategy generated according to the planning strategy instructions through the strategy value model.

3. The method of claim 2, wherein the policy value model comprises a policy network portion and a value network portion; the strategy network part is a mixed density network and is used for outputting a plurality of planning strategy instructions which accord with mixed Gaussian distribution; the value network part is used for evaluating a plurality of planning strategies generated according to the planning strategy instructions output by the strategy network part and outputting strategy evaluation corresponding to each planning strategy.

4. The method of claim 2 or 3, wherein the graph neural network model comprises a geometry sub-layer, a travel trajectory sub-layer, a map sub-layer, a pooling layer, and a global layer;

wherein:

the geometric sub-graph layer is used for extracting the characteristics of the geometric information, the driving track sub-graph layer is used for extracting the characteristics of the historical driving track information, and the map sub-graph layer is used for extracting the characteristics of the map information;

the pooling layer is used for respectively performing feature aggregation on the features extracted from the geometry sub-layer, the driving track sub-layer and the map sub-layer; and the global layer is used for carrying out multi-head self-attention processing on the aggregated features respectively obtained by the geometric sub-layer, the driving track sub-layer and the map sub-layer to obtain a fusion feature vector.

5. The method of claim 2 or 3, wherein the method further comprises:

and training the strategy value model based on decision planning supervision information generated by the MCTS.

6. The method of claim 5, wherein the training of the policy value model based on the MCTS-generated decision-making planning supervision information comprises:

in each iterative training, obtaining information of a plurality of planning strategy samples output by the MCTS based on driving perception sample data, driving target sample information and KR-AUCB of a continuous behavior space;

and training the strategy value model by taking the information of the planning strategy sample with the highest evaluation value of strategy evaluation as supervision information in the planning strategy samples.

7. The method of claim 6, wherein the obtaining the MCTS outputs information of a plurality of planning strategy samples based on driving perception sample data of a continuous behavior space, driving target sample information, and KR-AUCB, comprising:

based on the driving perception sample data and the driving target sample information of the continuous behavior space, selecting nodes from corresponding MCTs by using KR-AUCB to form an initial planning strategy;

creating a plurality of child nodes for leaf nodes of the initial planning strategy according to a plurality of action samples which are output by a reinforced network model and accord with mixed Gaussian distribution;

obtaining a plurality of extended planning strategies based on the plurality of created child nodes and the initial planning strategy;

performing strategy simulation on a plurality of extension planning strategies to obtain strategy evaluation corresponding to each extension planning strategy;

and outputting a plurality of planning strategy samples according to each expansion planning strategy and the corresponding strategy evaluation.

8. The method of claim 7, wherein the obtaining a plurality of extended planning strategies based on the created plurality of child nodes and the initial planning strategy comprises:

fitting the information of each child node in the created child nodes by using a Gaussian process function, and obtaining the candidate degree of the child node according to the mean value and the standard deviation of the Gaussian process after fitting and the distance between the child node and other child nodes;

selecting candidate child nodes from the plurality of child nodes according to the candidate degree of each child node;

and obtaining a plurality of expansion planning strategies according to the selected candidate child nodes and the initial planning strategy.

9. The method of claim 7 or 8, wherein the selecting nodes from corresponding MCTs using KR-AUCB to form an initial planning strategy comprises:

firstly, selecting a node with the maximum KR-AUCB value from MCT;

aiming at each level of non-leaf nodes of at least one level of non-leaf nodes corresponding to the node, selecting non-leaf nodes with KR-AUCB values higher than other same-level child nodes or with access times lower than other same-level child nodes;

selecting a leaf child node based on the leaf node corresponding to the last level non-leaf node in the at least one level non-leaf node;

and forming an initial planning strategy according to the selected nodes of each level.

10. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the decision-making planning method for automated driving according to any of claims 1-9.

11. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the automated driving decision planning method of any of claims 1-9.

12. A computer program product comprising computer instructions that instruct a computing device to perform operations corresponding to the decision-making planning method for autonomous driving according to any of claims 1-9.