CN116957458A

CN116957458A - Vehicle path determining method, device, equipment and storage medium

Info

Publication number: CN116957458A
Application number: CN202310955048.7A
Authority: CN
Inventors: 谢海琴; 吴佳霖; 何梁
Original assignee: Haier Digital Technology Shanghai Co Ltd; Cosmoplat Industrial Intelligent Research Institute Qingdao Co Ltd
Current assignee: Haier Digital Technology Shanghai Co Ltd; Cosmoplat Industrial Intelligent Research Institute Qingdao Co Ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-10-27

Abstract

The invention discloses a vehicle path determining method, a device, equipment and a storage medium, wherein the method comprises the following steps: and determining the information of each customer, the capacity of each vehicle and the position information of the warehouse, and inputting the position information of each customer, the service time window of each customer, the capacity of each vehicle and the position information of the warehouse into a pre-trained vehicle path determination model to obtain the path information of each vehicle output by the vehicle path determination model. Wherein the path information of each vehicle includes: and on the premise that the capacity of the vehicle is not exceeded and the service time window of each target client is not exceeded, the transportation sequence of each target client is determined, and the vehicle path determining model is a model obtained by training with the relevant statistics of the training path of each training vehicle as state information in the training process. The method determines that the vehicle path is efficient.

Description

Vehicle path determining method, device, equipment and storage medium

Technical Field

The present invention relates to the field of logistics technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining a vehicle path.

Background

The vehicle path problem is always a research hot spot in the logistics scheduling field, and the vehicle path planning is proper and can effectively distribute logistics resources, so that the logistics transportation cost is reduced. The vehicle path problem (Vehicle Routing Problem, abbreviated as VRP) was first proposed in the 60 s of the 20 th century, the basic problem being the capacity-limited vehicle path problem (Capacitated Vehicle Routing Problem, abbreviated as CVRP). The problem is that the vehicle running distance of all customer points is minimized under the condition that the vehicle capacity constraint condition is met, and the purpose of reducing logistics distribution cost is achieved.

At present, an ant colony algorithm, a simulated annealing algorithm, a particle swarm algorithm and the like can be adopted when the problem of the vehicle path is solved. These algorithms all require the use of warehouse and customer sites during the training process.

However, where the training process utilizes warehouse and customer sites, which makes re-iterative searches necessary for either warehouse nodes or customer nodes, the trained models are less reproducible, resulting in less efficient determination of vehicle paths.

Disclosure of Invention

The invention provides a vehicle path determining method, device, equipment and storage medium, which are used for solving the technical problem of low vehicle path determining efficiency caused by poor replicability of a model in the related technology.

According to an aspect of the present invention, there is provided a vehicle path determining method including:

determining information of each customer, capacity of each vehicle and position information of a warehouse; wherein the information of each client includes: position information of each client and service time window of each client;

inputting the position information of each customer, the service time window of each customer, the capacity of each vehicle and the position information of the warehouse into a pre-trained vehicle path determining model to obtain the path information of each vehicle output by the vehicle path determining model; wherein the path information of each vehicle includes: and on the premise that the capacity of the vehicle is not exceeded and the service time window of each target customer is not exceeded, the vehicle path determining model adopts a reinforcement learning algorithm to reduce the total distance of the training path of the training vehicle as an index, and in the training process, the reinforcement learning algorithm adjusts the state information according to a heuristic operator by taking the relevant statistic of the training path of each training vehicle as the state information to train the obtained model.

According to another aspect of the present invention, there is provided a vehicle path determining apparatus including:

the first determining module is used for determining information of each customer, capacity of each vehicle and position information of the warehouse; wherein the information of each client includes: position information of each client and service time window of each client;

the second determining module is used for inputting the position information of each customer, the service time window of each customer, the capacity of each vehicle and the position information of the warehouse into a pre-trained vehicle path determining model to obtain the path information of each vehicle output by the vehicle path determining model; wherein the path information of each vehicle includes: and on the premise that the capacity of the vehicle is not exceeded and the service time window of each target customer is not exceeded, the vehicle path determining model adopts a reinforcement learning algorithm to reduce the total distance of the training path of the training vehicle as an index, and in the training process, the reinforcement learning algorithm adjusts the state information according to a heuristic operator by taking the relevant statistic of the training path of each training vehicle as the state information to train the obtained model.

According to still another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the vehicle path determination method according to any one of the embodiments of the present invention.

According to yet another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a vehicle path determining method according to any one of the embodiments of the present invention.

The technical scheme of the embodiment of the invention comprises the following steps: determining information of each customer, capacity of each vehicle and position information of a warehouse, wherein the information of each customer comprises: the position information of each customer and the service time window of each customer are input into a pre-trained vehicle path determination model, so as to obtain the path information of each vehicle output by the vehicle path determination model. Wherein the path information of each vehicle includes: and on the premise that the capacity of the vehicle is not exceeded and the service time window of each target customer is not exceeded, determining a model by a vehicle path to adopt a reinforcement learning algorithm on the premise that the total distance of a training path of a training vehicle is reduced as an index, and training the obtained model by taking the relevant statistic of the training path of each training vehicle as state information in the training process, wherein the reinforcement learning algorithm adjusts the state information according to a heuristic operator in the training process. The method has the following technical effects: on the one hand, the vehicle path determining method provided by the embodiment solves the problem that the determined path information of the CVRPTW can not exceed the capacity of the vehicle and the service time window of each target client, and the planned path meets the actual requirement; on the other hand, in the embodiment, the correlation between the position information of the warehouse and the position information of the client in the training process of the vehicle path determining model is not strong, and the correlation statistics of the training path of each training vehicle is used as the state information, so that the trained vehicle path determining model is decoupled from the position information of the warehouse and the position information of the client, and when the position information of the actual warehouse and the position information of the client are inconsistent with the position information of the warehouse and the position information of the client in the training process, the model can be used for path planning, and the generalization capability and the replicability of the model are strong, so that the efficiency of determining the vehicle path is improved; on the other hand, the efficiency of processing the sequence optimization problem by the reinforcement learning algorithm is higher, so that the vehicle path determination model trained by the reinforcement learning algorithm in the embodiment further improves the efficiency of outputting the path information of the vehicle.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a vehicle path determining method according to an embodiment of the invention;

fig. 2 is a schematic diagram of an application scenario of a vehicle path determining method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a vehicle path output by a vehicle path determination method according to a first embodiment of the present invention;

fig. 4 is a flowchart of a vehicle path determining method according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram of a training vehicle path determination model in the embodiment of FIG. 4;

Fig. 6 is a schematic structural view of a vehicle path determining apparatus provided according to an embodiment of the present invention;

fig. 7 is a schematic structural view of another vehicle path determining apparatus provided according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device implementing a vehicle path determination method of an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "target," "training," and the like in the description and claims of the present invention and in the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Fig. 1 is a flowchart of a vehicle path determining method according to an embodiment of the present invention. The present embodiment is applicable to the case of determining a vehicle path in the logistical field, and the method may be performed by a vehicle path determining device, which may be implemented in hardware and/or software, and which may be configured in an electronic device, such as a computer device. As shown in fig. 1, the method comprises the steps of:

step 101: information of each customer, capacity of each vehicle, and location information of a warehouse are determined.

Wherein the information of each client includes: location information of each customer and service time window of each customer.

The information of each customer, the capacity of each vehicle, and the location information of the warehouse in this embodiment may be determined according to an actual transportation scenario.

The customer in this embodiment refers to a network site that requires a vehicle to transport an item. Illustratively, in the field of logistics, the customers here may be transfer stations, distribution points, etc. The clients in this embodiment correspond to a service time window. The service time window herein may be characterized by an earliest service start time and a latest service end time. The number of clients in this embodiment may be plural.

The warehouse in this embodiment refers to a place where the article is stored. Illustratively, in the field of logistics, the warehouse herein may be a provincial distribution site or the like.

The location information of the client and the location information of the warehouse in this embodiment may be characterized by longitude and latitude, or may be characterized by an established coordinate system. For example, a coordinate system is established with the position of the warehouse as the origin of the coordinate system, with either direction as the X-axis and the other direction perpendicular to the X-axis as the Y-axis. Alternatively, the coordinate system is established in other ways as long as the location information characterizing the warehouse and the location information of the customer can be implemented.

The capacity of the vehicle in the present embodiment refers to the maximum load amount of the vehicle or the maximum capacity of the vehicle. The type of the vehicle in this embodiment is not limited, and may be an electric vehicle, a fuel vehicle, or the like. The number of vehicles in the present embodiment may be plural. The capacities of the plurality of vehicles may be the same, partially the same, or completely different. This embodiment is not limited thereto.

The application scenario of the vehicle path determining method in this embodiment is to transport the items in the warehouse to each customer by using the vehicle, and not exceed the capacity of the vehicle and the service time window of each customer during the transportation.

Fig. 2 is a schematic diagram of an application scenario of a vehicle path determining method according to an embodiment of the present invention. As shown in fig. 2, point 21 represents the location of the warehouse and point 22 represents the location of the customer. Fig. 2 shows a capacity-limited windowed vehicle path problem (Capacitated Vehicle Routing Problem with Time Windows, CVRPTW) involving one warehouse, 30 customer points. In this problem, a fleet of limited capacity located in a warehouse must serve a group of customers with known needs and a specific time window. The goal is to determine a set of vehicle trips with the lowest total cost, i.e. the lowest sum of travel distances or travel times, so that each vehicle starts and ends at the warehouse, each customer is accessed once within its time window, and the total demand of any vehicle treatment does not exceed its capacity. Assume that the number of vehicles is 5. The vehicle path determining method provided in the embodiment needs to realize that 5 vehicles are used for transporting the articles in the warehouse to each customer, and the capacity of the vehicles is not exceeded and the service time window of each customer is not exceeded in the transportation process.

Step 102: and inputting the position information of each customer, the service time window of each customer, the capacity of each vehicle and the position information of the warehouse into a pre-trained vehicle path determination model to obtain the path information of each vehicle output by the vehicle path determination model.

Wherein the path information of each vehicle includes: and the transportation sequence of the target clients corresponding to each vehicle is carried out on the premise that the capacity of the vehicle is not exceeded and the service time window of the target clients is not exceeded. The vehicle path determining model adopts a reinforcement learning algorithm, takes the total distance of the training paths of the training vehicles as an index, and takes the relevant statistic of the training paths of each training vehicle as state information to train the obtained model in the training process. In the training process, the reinforcement learning algorithm adjusts the state information according to heuristic operators.

In the present embodiment, the vehicle path determination model is trained in advance. In the training process, a reinforcement learning algorithm is adopted. For example, deep Q-Learning (DQN) may be employed. In addition to the total distance of the training paths of the training vehicles being reduced as an index, the model training process in the embodiment also uses the relevant statistics of the training paths of each training vehicle as state information for training in the training process, and in the training process, the reinforcement learning algorithm adjusts the state information according to heuristic operators. The training vehicle in this embodiment refers to a vehicle in the training process, and the training path refers to a path determined for the training vehicle in the training process.

The relevant statistics of the training path for each training vehicle may be, for example, the total distance of the training path, the average demand of the clients of the training vehicle, the average service duration of the clients of the training vehicle, the average time window length of the training vehicle, etc. The important difference between the model training process in this embodiment and the model training process in the related art is that: in the training process, the correlation between the position information of the warehouse and the position information of the client is not strong, the position information of the warehouse and the position information of the client are only used as input, and the relevant statistic of the training path of each training vehicle is used as state information. The trained vehicle path determining model is decoupled from the position information of the warehouse and the position information of the client, and the model can be used for path planning when the position information of the actual warehouse and the position information of the client are inconsistent with the position information of the warehouse and the position information of the client in the training process. The generalization capability and replicability of the model are high, so that the efficiency of determining the vehicle path is improved.

The path information of each vehicle determined in this embodiment includes: and the transportation sequence of the target clients corresponding to each vehicle is carried out on the premise that the capacity of the vehicle is not exceeded and the service time window of the target clients is not exceeded. One possible representation of path information may be: the loading capacity of the vehicle A is A1, and the transportation sequence is as follows: (customers X1, 9:00) - (customers X2, 9:20) - (customers X3, 10:00), wherein A1 is less than the capacity of vehicle a, 9:00 represents the time to customer X1, 9:20 represents the time to customer X2, and 10:00 represents the time to customer X3. The service time window of the client X1 is (8:50-9:10), the service time window of the client X2 is (9:10-9:30), and the service time window of the client X3 is (9:50-10:10). A service time window that does not exceed each target customer in this embodiment refers to not earlier than the earliest service time of the target customer and not later than the latest service time of the target customer. The target client in this embodiment refers to a client to which the vehicle corresponds. It will be appreciated that the target client is a subset of all clients.

Fig. 3 is a schematic diagram of a vehicle path output by a vehicle path determining method according to an embodiment of the invention. As shown in FIG. 3, the warehouse 31, the location of the customer is shown. The number of customers is 14. For ease of description, each customer is numbered as shown. The service time window for each customer is shown. The number of vehicles is 4. One possible path information is shown below. The path of the vehicle B1 is: from warehouse departure-client 1-client 2-client 3-client 4-back to warehouse, the path of vehicle B2 is: from warehouse departure-client 5-client 6-client 7-back to warehouse, the path of vehicle B3 is: from warehouse departure-client 9-client 8-client 10-client 11-back to warehouse, the path of vehicle B4 is: from the warehouse, the client 12-the client 13-the client 14-back to the warehouse. Of course, the load of each vehicle does not exceed the capacity of the vehicle and does not exceed the service time window of each target customer.

Optionally, the vehicle path determination model in the present embodiment is also a model trained with the reduced total waiting time of each training vehicle as an index. The vehicle path information output by the vehicle path determination model obtained based on the mode not only can realize the shortest total distance, but also can realize the minimum waiting time of the vehicle, thereby realizing the load balance of the vehicle.

Optionally, the information of the client in the present embodiment further includes: the amount of demand of the customer and the service duration of the customer. Where the customer demand refers to the volume, weight, or quantity of items that the customer needs, etc. The service duration of the customer refers to the service duration that the vehicle needs to be at the customer, i.e., the residence time of the vehicle at the customer. The vehicle path information obtained based on the realization can also realize the requirement of the customer and the service time of the customer, and can meet the logistics scene with finer requirements in practice.

Because the reinforcement learning algorithm can effectively overcome various disturbance in the system, the vehicle path determination model in the embodiment has strong anti-interference capability and high accuracy of the output vehicle path information. In addition, since the reinforcement learning algorithm more easily realizes the migration optimization of the similar optimization problem in the optimization problem, the vehicle path determination model in the embodiment more easily realizes the migration optimization of the similar optimization problem.

The vehicle path determining method provided in the embodiment includes: determining information of each customer, capacity of each vehicle and position information of a warehouse, wherein the information of each customer comprises: the position information of each customer and the service time window of each customer are input into a pre-trained vehicle path determination model, so as to obtain the path information of each vehicle output by the vehicle path determination model. Wherein the path information of each vehicle includes: and on the premise that the capacity of the vehicle is not exceeded and the service time window of each target customer is not exceeded, determining a model by a vehicle path to adopt a reinforcement learning algorithm on the premise that the total distance of a training path of a training vehicle is reduced as an index, and training the obtained model by taking the relevant statistic of the training path of each training vehicle as state information in the training process, wherein the reinforcement learning algorithm adjusts the state information according to a heuristic operator in the training process. The method has the following technical effects: on the one hand, the vehicle path determining method provided by the embodiment solves the problem that the determined path information of the CVRPTW can not exceed the capacity of the vehicle and the service time window of each target client, and the planned path meets the actual requirement; on the other hand, in the embodiment, the correlation between the position information of the warehouse and the position information of the client in the training process of the vehicle path determining model is not strong, and the correlation statistics of the training path of each training vehicle is used as the state information, so that the trained vehicle path determining model is decoupled from the position information of the warehouse and the position information of the client, and when the position information of the actual warehouse and the position information of the client are inconsistent with the position information of the warehouse and the position information of the client in the training process, the model can be used for path planning, and the generalization capability and the replicability of the model are strong, so that the efficiency of determining the vehicle path is improved; on the other hand, the efficiency of processing the sequence optimization problem by the reinforcement learning algorithm is higher, so that the vehicle path determination model trained by the reinforcement learning algorithm in the embodiment further improves the efficiency of outputting the path information of the vehicle.

Fig. 4 is a flowchart of a vehicle path determining method according to a second embodiment of the present invention. The vehicle path determining method provided in this embodiment describes in detail how to obtain the vehicle path determining model based on the embodiment shown in fig. 1 and various alternative implementations. As shown in fig. 4, the method comprises the steps of:

step 401: and training to obtain a vehicle path determination model by using a reinforcement learning algorithm.

FIG. 5 is a schematic diagram of a training vehicle path determination model in the embodiment of FIG. 4. As shown in fig. 5, the process of training the vehicle path determination model using the reinforcement learning algorithm includes the following steps.

Step 4011: training data is determined.

Wherein the training data comprises: information of each training client, capacity of each training vehicle, and location information of the training warehouse. The information of each training client includes: the location information of each training client, the service time window of each training client, the demand of each training client, and the service time period of each training client.

The training clients in this embodiment refer to clients used in the training process, and the training vehicles refer to vehicles used in the training process.

Alternatively, the training data in this embodiment may be Solomon (1987) related instance data. Instance related features assist in algorithm design through basic information of instance data. The values of these features are deterministic and will not change in the algorithm design once a given instance. The definition of the training data is shown in table 1 below.

Table 1 training data

Step 4012: and acquiring initial state information according to the training data.

Wherein the initial state information includes: relevant statistics of the initial path of each training vehicle.

Alternatively, the initial state information may be generated in a random generation manner based on the training data. Initial state information may also be generated using heuristic operators based on training data.

Step 4013: and inputting the initial state information into the initial vehicle path determining model to obtain new state information and an initial heuristic operator which are output in the initial vehicle path determining model.

Wherein the new state information includes: relevant statistics of the updated path for each training vehicle. The initial heuristic operator is used for adjusting the initial state information to obtain new state information.

The initial vehicle path determination model in the present embodiment set is an initial model established using the DQN algorithm.

The heuristic operator in this embodiment refers to a manner of adjusting the state information.

After the initial state information is input into the initial vehicle path determination model, the initial vehicle path determination model may adjust the initial state information based on the initial heuristic operator, thereby obtaining new state information.

In one implementation manner, the initial vehicle path determining model in this embodiment may obtain updated paths of all training vehicles according to the initial state information and the initial heuristic operator, and then determine new state information according to the updated paths of all vehicles. Thus, the update path and the new state information have a correspondence relationship, and the update path and the new state information are substantially the same. In another implementation manner, the initial vehicle path determining model in this embodiment may determine new state information according to the initial state information and the initial heuristic operator, and obtain updated paths of all the training vehicles according to the new state information.

Included in the state information (state) in the present embodiment are all relevant statistics of the paths of the vehicle.

State represents the State of an object (i.e. a path), and the design purpose reflects the influence of an operator (Action) on the object as much as possible, and can play a role in predicting and selecting selected actions. Since the solution in this embodiment is CVRPTW, it is difficult to divide the confirmation range in advance for the State value. The State value is entirely dependent on the effect of the Action on the line distance value of the solution. Meanwhile, in order to ensure the diversity of solutions in the selection operator, the State value needs to be changed continuously in a certain interval within a certain period (especially under the condition that the optimal solution is not obtained). State is therefore designed as follows (the following variables are all statistics of customers arranged in each vehicle).

TABLE 2 State information

Based on table 2 above, it can be appreciated that the new state information includes: the total distance of the update path of the training vehicle k (i.e., the kth vehicle in table 2 above), the load of the training vehicle k, the average demand of the customers of the training vehicle k, the average service duration of the customers of the training vehicle k, the average time window length of the training vehicle k, the average time window overlap between the customers of the training vehicle k, the proportion of the customers of the training vehicle k that are time-limited. Where K is an integer greater than 0 and less than K, and K is the total number of training vehicles. The content included in the initial state information is similar and will not be described here again. It will be appreciated that the number of training vehicles is K, and the status information may be a matrix of k×7 or 7*K. The distance of the path refers to the total travel distance of the training vehicle k based on the path transportation.

In Table 2 above, n _k Representing the total number of clients training vehicle k. d, d _i,k Indicating the end service time of the ith customer of the training vehicle k. r is (r) _i,k Representing the start service time of the ith customer of the training vehicle k. i is an integer greater than or equal to 1, n _k Is an integer greater than 0 and less than N, N being the total number of training clients. d _j,k Represents the end service time, r, of the jth customer of the training vehicle k _5,k The start service time of the jth customer of the training vehicle k is represented, j being an integer greater than or equal to 1. It will be appreciated that inter (i, j) calculates the intersection of the service time windows of the respective training clients for training vehicle k. The unit (i, j) calculates the union of the service time windows of the respective training clients corresponding to the training vehicles k.

Step 4014: and determining the rewarding value according to the new state information and the reference state information.

Wherein the reward value is proportional to a degree of decrease in a total distance of the updated paths of all the training vehicles corresponding to the new state information relative to a total distance of the reference paths of all the training vehicles corresponding to the reference state information. The reference state information is state information or initial state information of which the total distance is reduced for the first time in the training process.

The reference state information in this embodiment may be state information or initial state information of the first decrease in the total distance corresponding to the training process. In practical applications, the device can be set according to requirements.

The prize value is proportional to the degree of decrease in the total distance of the updated paths of all the training vehicles corresponding to the new state information relative to the total distance of the reference paths of all the training vehicles corresponding to the reference state information, which means that the higher the degree of decrease, the greater the prize value.

The total distance of the update paths of all the training vehicles in this embodiment refers to the sum of the distances of the update paths of all the training vehicles. For example, assume that there are 4 training vehicles: c1, C2, C3 and C4. The distance of the update path corresponding to C1 is L1, the distance of the update path corresponding to C2 is L2, the distance of the update path corresponding to C3 is L3, and the distance of the update path corresponding to C4 is L4. The total distance of the update paths of all the training vehicles in the present embodiment is l1+l2+l3+l4.

In one implementation, when the total distance of the reference paths of all the training vehicles corresponding to the reference state information is greater than the total distance of the update paths of all the training vehicles corresponding to the new state information, the reward value is the difference between the total distance of the reference paths of all the training vehicles corresponding to the reference state information and the total distance of the update paths of all the training vehicles corresponding to the new state information. And when the total distance of the reference paths of all the training vehicles corresponding to the reference state information is smaller than or equal to the total distance of the update paths of all the training vehicles corresponding to the new state information, the rewarding value is a preset value smaller than or equal to zero.

More specifically, the bonus value is also proportional to the degree of decrease in the total waiting time of all the training vehicles corresponding to the new state information relative to the total waiting time of all the training vehicles corresponding to the reference state information, and the degree of decrease in the standard deviation of the distance series formed by the distances of the updated paths of all the training vehicles corresponding to the new state information relative to the standard deviation of the distance series formed by the distances of the reference paths of all the training vehicles corresponding to the reference state information. That is, the higher the degree of reduction, the greater the prize value.

The waiting time of the training vehicle may be determined based on the time the training vehicle arrives at the customer and the earliest start-up service time of the customer. For example, the wait time may be the difference between the earliest start of service time of the customer and the time the training vehicle arrives at the customer. The total waiting time of all the training vehicles refers to the sum of the waiting times of all the training vehicles. The distance sequence formed by the distances of the updated paths of all the training vehicles refers to a distance sequence formed by arranging the distances of the updated paths of all the training vehicles randomly or according to a preset rule. For example, assume that there are 4 training vehicles: c1, C2, C3 and C4. The distance of the update path corresponding to C1 is L1, the distance of the update path corresponding to C2 is L2, the distance of the update path corresponding to C3 is L3, and the distance of the update path corresponding to C4 is L4. The distance sequence formed by the distances of the update paths of all the training vehicles in the present embodiment is (L1, L2, L3, L4).

Alternatively, the implementation manner of step 4014 may be: the total distance of the update paths of all the training vehicles corresponding to the new state information, the standard deviation of a distance sequence formed by the distances of the update paths of all the training vehicles corresponding to the new state information and the total waiting time of all the training vehicles corresponding to the new state information are used as an objective function corresponding to the new state information after being subjected to preset processing; the total distance of the reference paths of all the training vehicles corresponding to the reference state information, the standard deviation of a distance sequence formed by the distances of the reference paths of all the training vehicles corresponding to the reference state information and the total waiting time of all the training vehicles corresponding to the reference state information are used as an objective function corresponding to the reference state information after being subjected to preset processing; if the objective function corresponding to the new state information is smaller than the objective function corresponding to the reference state information, determining that the reward value is the difference value of the objective function corresponding to the reference state information and the objective function corresponding to the new state information; and if the objective function corresponding to the new state information is greater than or equal to the objective function corresponding to the reference path, determining that the reward value is a preset value smaller than or equal to zero.

The above-mentioned preset process may be a sum operation. ThenObject function (objective) corresponding to new state information _t ＝dist _t +diststd _t +waittime _t ，dist _t Representing the total distance of the update paths of all the training vehicles corresponding to the new state information, diststd _t Standard deviation of distance sequence formed by distance representing update paths of all training vehicles corresponding to new state information _t Representing the total waiting time of all the training vehicles corresponding to the new state information. Object function (objective) corresponding to reference state information ₀ ＝dist ₀ +diststd ₀ +waittime ₀ ，dist ₀ Indicating the total distance of the reference paths of all the training vehicles corresponding to the reference state information, diststd ₀ Standard deviation of distance sequence formed by distance of reference path of all training vehicles corresponding to reference state information ₀ Indicating the total waiting time of all the training vehicles corresponding to the reference state information. If the objective function corresponding to the new state information is smaller than the objective function corresponding to the reference state information, the Reward value reward=objective ₀ -objective _t The method comprises the steps of carrying out a first treatment on the surface of the And if the objective function corresponding to the new state information is greater than or equal to the objective function corresponding to the reference path, determining that the reward value is a preset value smaller than or equal to zero. The preset value may be, for example, zero.

Step 4015: and determining an updated initial vehicle path determination model according to the rewards value.

After determining the reward value, the initial vehicle path determination model may be updated using a random gradient descent algorithm to obtain an updated initial vehicle path determination model.

Step 4016: if the preset training end condition is not satisfied, determining the new state information as updated initial state information, and returning to the step of executing step 4013.

Step 4017: and if the preset training ending condition is met, determining the updated initial vehicle path determining model as a vehicle path determining model.

In this embodiment, two types of operators are included: a first type of operator and a second type of operator. The adjustment degree of the second type operator on the initial state information is larger than that of the first type operator on the initial state information. Multiple operators may be included in each type of operator.

The initial heuristic operator and the updated initial heuristic operator in this embodiment may be a first type operator or a second type operator.

Two types of operators are described in detail below.

The first type of operator may also be referred to as a lifting operator. The lifting operator is used to continue lifting on the current solution. If it is decided to continue lifting on the current solution, then a lifting operator is selected to try to lift the effect based on the reinforcement learning method. Illustratively, the lifting operators may include the following six lifting operators. Wherein m and n are integers greater than 0.

The second type of operator may also be referred to as a destroy operator. The destroy operator is used to disrupt the current route and re-route generation, and then restart lifting on the current route result. Illustratively, the destroy operators may include the following three destroy operators.

In this embodiment, to avoid trapping in the locally optimal solution, the updated initial heuristic operator may be determined according to the reward value: if determining that the rewarding values corresponding to the new state information are smaller than or equal to zero in the preset iteration times, determining the second type operator as an updated initial heuristic operator; if any one of the reward values corresponding to the new state information is larger than zero in the preset iteration times, determining the first type operator as an updated initial heuristic operator. The updated initial heuristic operator is used for adjusting the new state information to obtain updated new state information.

It will be appreciated that the initial heuristic operator of the initial vehicle path determination model output in this embodiment may be either a first type operator or a second type operator. To avoid trapping in the locally optimal solution, if after L lifting steps the object is no longer reduced, i.e. the prize value is less than or equal to zero, a destroy (perturb) operation is performed, and then the lifting iteration is restarted, i.e. one destroy operator is selected as the updated initial heuristic operator.

Compared with the heuristic algorithm and meta-heuristic algorithm in the related art, the solution quality is low and is easy to fall into local optimum, two types of operators can be adopted in the training process of the vehicle path determination model, and the second type of operators with higher adjustment degree can be used under the condition that the objective is not reduced after L lifting steps. Therefore, the trained vehicle path determination model cannot fall into local optimum, and the accuracy of the output vehicle path information is higher. Specifically, in the present embodiment, the total distance of the paths of all the vehicles, the total waiting time of all the vehicles, and the standard deviation of the distance sequence formed by the distances of the paths of all the vehicles in the path information of the outputted vehicles are smaller.

Step 402: information of each customer, capacity of each vehicle, and location information of a warehouse are determined.

Optionally, the information of the client further includes: the amount of demand of the customer and the service duration of the customer.

Step 403: and inputting the position information of each customer, the service time window of each customer, the capacity of each vehicle and the position information of the warehouse into a pre-trained vehicle path determination model to obtain the path information of each vehicle output by the vehicle path determination model.

Wherein the path information of each vehicle includes: and the transportation sequence of the target clients corresponding to each vehicle is carried out on the premise that the capacity of the vehicle is not exceeded and the service time window of the target clients is not exceeded. The vehicle path determining model is a model which is obtained by training by adopting a reinforcement learning algorithm and taking the total distance of the training paths of the training vehicles as an index and taking the relevant statistic of the training paths of each training vehicle as state information in the training process, and the reinforcement learning algorithm adjusts the state information according to heuristic operators in the training process.

Alternatively, the path information of the vehicle output by the vehicle path determination model in the present embodiment has the following features: the total distance of the paths of all vehicles, the total waiting time of all vehicles and the standard deviation of the distance sequence formed by the distances of the paths of all vehicles are all minimum.

The Hyper-Heuristic (Hyper-Heuristic Algorithm, HH) is called "Heuristic search Heuristic", and is mainly composed of two parts, namely a High-level heuristics (HLH) and a Low-level heuristics (LLH). LLH is used for independently searching actual problems, and HLH is used for scientifically and reasonably selecting LLH and judging and accepting generated solutions. HLH design refers mainly to both aspects of the selection strategy and acceptance criteria for the solution. The hyper-heuristic algorithm has been widely applied to various fields due to the advantages of universality, high efficiency and the like. In view of the excellent searching performance of the super heuristic algorithm, reinforcement learning can be based on learning evaluation actions, the DQN algorithm is introduced into a high-level strategy design of the super heuristic algorithm, and the bottom heuristic operator is weighed by utilizing a learning mechanism to immediately give out rewards and punishments in the future; and setting a better sequence pool, and simultaneously using an acceptance criterion of simulated annealing to more effectively guide the bottom operator to search so as to obtain a high-quality solution.

The reinforcement learning algorithm has the advantages that an intelligent agent can obtain an optimal strategy through interaction with the environment, so that the possibility of solving the problem of large-scale complex CVRPTW is realized, and the limitation that the algorithm is too long in execution time and the model is not easy to converge exists. Therefore, through fusion with the hyper-heuristic algorithm, the reinforcement learning algorithm learns the heuristic search operator (i.e. the operator in the embodiment), so that the solving speed is increased, the quality of the solution is improved, and compared with a manual search strategy, the reinforcement learning algorithm can effectively improve the search efficiency and the search space and has very strong optimization capacity and generalization capacity.

The invention has the advantage that the reinforcement learning-based hyper-heuristic algorithm proposed for the CVRPTW problem can comprise 6 lifting operators and 3 destroying operators. And (3) introducing a DQN algorithm in reinforcement learning, evaluating the performance of the bottom operator in each state of the solution by combining the thought of the Q learning algorithm and the structure of the neural network algorithm, conditionally accepting the obtained non-improved solution, guiding the algorithm to jump out of the local optimal condition, and continuously searching the space of the high-quality solution.

According to the vehicle path determining method, on one hand, the trained vehicle path determining model does not fall into local optimum, the accuracy of the output path information of the vehicle is higher, on the other hand, the standard deviation of the distance sequence formed by the total distance of the paths of all vehicles, the total waiting time of all vehicles and the distance of the paths of all vehicles in the path information of the vehicle output by the vehicle path determining model is smaller, besides the minimum total distance and the minimum total waiting time, the standard deviation of the distance sequence formed by the distance of the paths of all vehicles is smaller, namely, the dispersion degree of the distance of the paths of all vehicles is smaller, which is equivalent to realizing the load balancing of the vehicles.

Fig. 6 is a schematic structural view of a vehicle path determining apparatus provided according to an embodiment of the present invention. The device can be arranged in electronic equipment such as computer equipment and the like. As shown in fig. 6, the vehicle path determining apparatus provided by the present embodiment includes the following modules: the first determination module 61 and the second determination module 62.

A first determining module 61 for determining information of each customer, capacity of each vehicle, and location information of a warehouse.

And a second determining module 62, configured to input the location information of each customer, the service time window of each customer, the capacity of each vehicle, and the location information of the warehouse into a pre-trained vehicle path determining model, and obtain path information of each vehicle output by the vehicle path determining model.

Wherein the path information of each vehicle includes: and a transportation order for each target customer corresponding to each vehicle, without exceeding the capacity of the vehicle and without exceeding a service time window for each target customer. The vehicle path determining model is a model which is obtained by adopting a reinforcement learning algorithm, taking the total distance of the training paths of the training vehicles as an index, and taking the relevant statistics of the training paths of each training vehicle as state information in the training process. In the training process, the reinforcement learning algorithm adjusts the state information according to heuristic operators.

Optionally, the vehicle path determination model is also a model trained with an indicator of reducing the total waiting time of each training vehicle.

Optionally, the information of the client further includes: the customer's demand and the customer's service duration.

The vehicle path determining device provided by the embodiment of the invention can execute the vehicle path determining method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.

Fig. 7 is a schematic structural view of another vehicle path determining apparatus provided according to an embodiment of the present invention. This embodiment describes another model included in the vehicle path determination module based on the embodiment shown in fig. 6. As shown in fig. 7, the vehicle path determining apparatus provided in the present embodiment further includes a training module 71.

The training module 71 is configured to train to obtain a vehicle path determination model using a reinforcement learning algorithm.

Training module 71 includes the following sub-modules: the system comprises a first determining sub-module, a first acquiring sub-module, a second determining sub-module, a third determining sub-module, a fourth determining sub-module, a return executing sub-module and a fifth determining sub-module.

And the first determination submodule is used for determining training data.

Wherein the training data comprises: information of each training client, capacity of each training vehicle, and location information of the training warehouse. The information of each training client comprises: the position information of each training client, the service time window of each training client, the demand of each training client and the service time length of each training client.

And the first acquisition sub-module is used for acquiring initial state information according to the training data.

And the second determining submodule is used for inputting the initial state information into the initial vehicle path determining model to obtain the new state information and the initial heuristic operator which are output in the initial vehicle path determining model.

Wherein the new state information includes: relevant statistics of the updated path for each training vehicle. The initial heuristic operator is used for adjusting the initial state information to obtain the new state information.

And the third determining submodule is used for determining the rewarding value according to the new state information and the reference state information.

Wherein the reward value is proportional to a degree of decrease in a total distance of update paths of all training vehicles corresponding to the new state information relative to a total distance of reference paths of all training vehicles corresponding to the reference state information. The reference state information is state information or initial state information of which the total distance is reduced for the first time in the training process.

And the fourth determining submodule is used for determining an updated initial vehicle path determining model according to the rewarding value.

And the return execution sub-module is used for determining the new state information as updated initial state information if the preset training ending condition is not met, and returning to the step of executing the second determination sub-module.

And a fifth determining sub-module, configured to determine the updated initial vehicle path determination model as the vehicle path determination model if the preset training end condition is satisfied.

In an embodiment, the reward value is further proportional to a reduction degree of a total waiting time of all training vehicles corresponding to the new state information with respect to a total waiting time of all training vehicles corresponding to the reference state information, and a standard deviation of a distance sequence formed by distances of updated paths of all training vehicles corresponding to the new state information is proportional to a reduction degree of a standard deviation of a distance sequence formed by distances of reference paths of all training vehicles corresponding to the reference state information.

In an embodiment, the third determining submodule is specifically configured to: the total distance of the update paths of all the training vehicles corresponding to the new state information, the standard deviation of a distance sequence formed by the distances of the update paths of all the training vehicles corresponding to the new state information and the total waiting time of all the training vehicles corresponding to the new state information are used as an objective function corresponding to the new state information after being subjected to preset processing; the total distance of the reference paths of all the training vehicles corresponding to the reference state information, the standard deviation of a distance sequence formed by the distances of the reference paths of all the training vehicles corresponding to the reference state information and the total waiting time of all the training vehicles corresponding to the reference state information are subjected to preset processing and then serve as an objective function corresponding to the reference state information; if the objective function corresponding to the new state information is smaller than the objective function corresponding to the reference state information, determining that the reward value is the difference value between the objective function corresponding to the reference state information and the objective function corresponding to the new state information; and if the objective function corresponding to the new state information is greater than or equal to the objective function corresponding to the reference path, determining that the reward value is a preset value smaller than or equal to zero.

Optionally, the initial heuristic operator includes a first type operator or a second type operator. The adjustment degree of the second type operator on the initial state information is larger than that of the first type operator on the initial state information.

Optionally, if it is determined that the reward values corresponding to the new state information are all smaller than or equal to zero within the preset iteration times, determining the second type operator as an updated initial heuristic operator; the updated initial heuristic operator is used for adjusting the new state information to obtain updated new state information. If any one of the reward values corresponding to the new state information is larger than zero in the preset iteration times, determining the first type operator as an updated initial heuristic operator.

Optionally, the new state information includes: the total distance of the updated path of the training vehicle k, the load of the training vehicle k, the average demand of the clients of the training vehicle k, the average service duration of the clients of the training vehicle k, the average time window length of the training vehicle k, the average time window overlap between the clients of the training vehicle k, and the proportion of the clients of the training vehicle k that are time-limited. Where K is an integer greater than 0 and less than K, and K is the total number of training vehicles.

Optionally, the average time window length of the training vehicle k is determined by the following formula:wherein n is _k Representing the total number of clients training vehicle k. d, d _i,k Indicating the end service time of the ith customer of the training vehicle k. r is (r) _i,k Representing the start service time of the ith customer of the training vehicle k. i is an integer greater than or equal to 1, n _k Is an integer greater than 0 and less than N, N being the total number of training clients.

Optionally, the average time window overlap between the clients of the training vehicle k is determined by the following formula:wherein the method comprises the steps of：

d _5,k Indicating the end service time of the jth customer of the training vehicle k. Representing the start service time of the jth customer training vehicle k. j is an integer greater than or equal to 1.

Fig. 8 is a schematic structural diagram of an electronic device implementing a vehicle path determination method of an embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above.

In some embodiments, the vehicle path determination method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the vehicle path determination method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the vehicle path determination method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A vehicle path determination method, characterized by comprising:

inputting the position information of each customer, the service time window of each customer, the capacity of each vehicle and the position information of the warehouse into a pre-trained vehicle path determining model to obtain the path information of each vehicle output by the vehicle path determining model; wherein the path information of each vehicle includes: and on the premise that the capacity of the vehicle is not exceeded and the service time window of each target customer is not exceeded, the vehicle path determining model adopts a reinforcement learning algorithm to reduce the total distance of the training path of the training vehicle as an index, and in the training process, the relevant statistic of the training path of each training vehicle is used as state information to train the obtained model, and in the training process, the reinforcement learning algorithm adjusts the state information according to a heuristic operator.

2. The method of claim 1, wherein the vehicle path determination model is further a model trained on the basis of reducing a total waiting time of each training vehicle.

3. The method of claim 1, wherein the customer's information further comprises: the customer's demand and the customer's service duration.

4. A method according to claim 3, characterized in that the method further comprises:

determining training data; wherein the training data comprises: information of each training client, capacity of each training vehicle and position information of a training warehouse, wherein the information of each training client comprises: the position information of each training client, the service time window of each training client, the demand of each training client and the service time of each training client;

acquiring initial state information according to the training data; wherein the initial state information includes: correlation statistics for the initial path of each training vehicle;

inputting the initial state information into an initial vehicle path determining model to obtain new state information and an initial heuristic operator which are output in the initial vehicle path determining model; wherein the new state information includes: the relevant statistics of the updated path of each training vehicle is used for adjusting the initial state information to obtain the new state information;

Determining a reward value according to the new state information and the reference state information; the reward value is in direct proportion to the reduction degree of the total distance of the updated paths of all training vehicles corresponding to the new state information relative to the total distance of the reference paths of all training vehicles corresponding to the reference state information, wherein the reference state information is state information or initial state information of which the total distance is reduced for the first time in the training process;

determining an updated initial vehicle path determination model according to the reward value;

if the preset training ending condition is not met, determining the new state information as updated initial state information, and returning to execute the step of inputting the initial state information into the initial vehicle path determining model to obtain the new state information and an initial heuristic operator which are output in the initial vehicle path determining model;

and if the preset training ending condition is met, determining the updated initial vehicle path determining model as the vehicle path determining model.

5. The method of claim 4, wherein the reward value is further proportional to a degree of reduction in total waiting time of all training vehicles corresponding to the new state information relative to total waiting time of all training vehicles corresponding to the reference state information, and a degree of reduction in standard deviation of a distance sequence formed by distances of update paths of all training vehicles corresponding to the new state information relative to a standard deviation of a distance sequence formed by distances of reference paths of all training vehicles corresponding to the reference state information.

6. The method of claim 5, wherein determining a prize value based on the new state information and the reference state information comprises:

the total distance of the update paths of all the training vehicles corresponding to the new state information, the standard deviation of a distance sequence formed by the distances of the update paths of all the training vehicles corresponding to the new state information and the total waiting time of all the training vehicles corresponding to the new state information are used as an objective function corresponding to the new state information after being subjected to preset processing;

the total distance of the reference paths of all the training vehicles corresponding to the reference state information, the standard deviation of a distance sequence formed by the distances of the reference paths of all the training vehicles corresponding to the reference state information and the total waiting time of all the training vehicles corresponding to the reference state information are subjected to preset processing and then serve as an objective function corresponding to the reference state information;

if the objective function corresponding to the new state information is smaller than the objective function corresponding to the reference state information, determining that the reward value is the difference value between the objective function corresponding to the reference state information and the objective function corresponding to the new state information;

And if the objective function corresponding to the new state information is greater than or equal to the objective function corresponding to the reference path, determining that the reward value is a preset value smaller than or equal to zero.

7. The method of claim 5, wherein the initial heuristic operator comprises a first type of operator or a second type of operator, wherein the second type of operator adjusts the initial state information to a greater degree than the first type of operator adjusts the initial state information.

8. The method according to claim 7, wherein if determining that the prize values corresponding to the new state information are all less than or equal to zero within a preset number of iterations, determining the second type operator as an updated initial heuristic operator; the updated initial heuristic operator is used for adjusting the new state information to obtain updated new state information;

if any one of the reward values corresponding to the new state information is larger than zero in the preset iteration times, determining the first type operator as an updated initial heuristic operator.

9. The method according to any one of claims 4 to 8, wherein the new state information comprises: the total distance of the updated path of the training vehicle k, the loading capacity of the training vehicle k, the average demand of the clients of the training vehicle k, the average service duration of the clients of the training vehicle k, the average time window length of the training vehicle k, the average time window overlapping degree among the clients of the training vehicle k, and the proportion of the clients of the training vehicle k limited by time; where K is an integer greater than 0 and less than K, and K is the total number of training vehicles.

10. The method according to claim 9, characterized in that the average time window length of the training vehicle k is determined by the following formula:

wherein n is _k Representing the total number of clients training the vehicle k, d _i,k End service time, r, representing the ith customer of the training vehicle k _i,k Indicating a start service time of an ith customer of the training vehicle k, i being an integer greater than or equal to 1, n _k An integer greater than 0 and less than N, N being the total number of training clients;

the average time window overlap between the clients of the training vehicle k is determined by the following formula:

wherein (1)> d _j,k Represents the end service time, r, of the jth customer of the training vehicle k _5,k The start service time of the jth customer of the training vehicle k is represented, j being an integer greater than or equal to 1.

11. A vehicle path determining apparatus, characterized by comprising:

12. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the vehicle path determination method of any one of claims 1-10.

13. A computer readable storage medium storing computer instructions for causing a processor to perform the vehicle path determination method of any one of claims 1-10.