CN117474295B

CN117474295B - Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method

Info

Publication number: CN117474295B
Application number: CN202311805708.XA
Authority: CN
Inventors: 张秀梅; 李文松; 李慧; 刘芳; 刘方达
Original assignee: Changchun University of Technology
Current assignee: Changchun University of Technology
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-04-26
Anticipated expiration: 2043-12-26
Also published as: CN117474295A

Abstract

The invention provides a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm, and relates to the field of automatic distribution of warehouse logistics in intelligent workshops. Production job data is collected at a plant and a Markov decision model is built based on the data. Setting training data samples, optimizing a neural network structure of data by using Dueling DQN algorithm, and modeling the cost function and the dominance function separately by using the output layer action cost function Q as the linear sum of the cost function and the dominance function, so that the intelligent agent can better process the state with smaller action association. And constructing a relationship between the reward and punishment function and the road network load, and integrating the path length and the road network load into the reward and punishment function. The task scheduling matching execution mechanism for constructing the difference between the attention state value and the action dominant value can be applied to the AGV task scheduling field of the warehouse workshop. Compared with the prior art, the method can efficiently optimize road network load, accurately match different states and action scheduling strategies, and greatly improve production efficiency.

Description

Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method

Technical Field

The invention belongs to the field of automatic distribution of warehouse logistics in intelligent workshops, and particularly relates to a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm.

Background

Under the large background of intelligent manufacturing, intelligent warehouse workshop job scheduling has become one of key links for improving resource utilization rate and thus improving enterprise operation benefits. How to avoid the conflict of a plurality of AGVs in the workshop road network, improve the scheduling efficiency of the intelligent workshop, rationalize the use of road network resources and be the key place for researching the problems of load balancing and task scheduling.

Currently, from the application scene, the research of the AGV scheduling problem can be divided into a path planning problem and a task allocation problem, the single AGV path planning problem only needs to consider how one AGV bypasses an obstacle to find an optimal path, the research of the problem is mature, a graph theory algorithm is usually adopted, and a plurality of heuristic algorithms are applied to the problem. However, in practical situations, a plurality of AGVs perform tasks simultaneously, mutual interference and collision can be generated among the AGVs, and the large-scale AGVs of the logistics factory can generate the congestion problem of the whole road network load. For the problem of load balancing, most heuristic algorithms do not have considerable stability, and the performance of the heuristic algorithms is directly dependent on the simplicity of the problem and the experience of an expert.

When the AGV is in scheduling, a plurality of scheduling rules exist, namely tasks are selected based on the arrival sequence, tasks are selected based on the shortest travel and tasks are selected based on the longest waiting time, and a designer selects different scheduling rules according to different requirements. The common algorithms include an A-algorithm, a genetic algorithm and a simulated annealing algorithm, and each AGV can be respectively scheduled in a single small-scale manner due to instability of the common algorithms. In order to realize the dispatching of large-scale AGVs, some people use deep reinforcement learning to conduct intelligent workshop task dispatching, a plurality of AGVs are placed in a road network to interact with the environment in real time, an intelligent body randomly selects an action according to the current task state, then the action is scored, different reward rules can be designed according to different user requirements, and iterative updating is conducted in sequence until the task is completed.

The deep learning DQN algorithm can obtain a better solution on the large-scale AGV scheduling problem, but the mere use of the algorithm can cause overestimation, so that the scheduling result has large error, and the trained model has poor effect. In order to improve the training effect of the model, the Nature DQN uses a new target Q network with the same structure to calculate the target Q value by using two identical neural networks to solve the correlation between the data sample and the network before training, but the accuracy problem of the target Q value is not guaranteed. The Double DQN algorithm can eliminate the problem of overestimation by decoupling the selection of the target Q action and the calculation of the target Q, but this algorithm requires a lot of experience to train, which is difficult to get at the beginning and thus the initial performance of the algorithm is poor.

Disclosure of Invention

In order to solve the problems, the embodiment of the invention provides a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm, which improves reward and punishment functions, combines path length and load coefficients with the reward and punishment functions, optimizes a road network high load area and improves overall operation smoothness. Optimizing the neural network structure of the output end, constructing task scheduling matching execution mechanisms of differences of the attention state value and the action dominant value under different conditions, linearly adding the attention state value function and the attention state-action dominant function to output an action value function Q, and finally selecting actions based on the designed scheduling spending time T, and constructing the task scheduling together with data in an experience pool.

The technical scheme adopted for solving the technical problems is as follows: a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm comprises the following steps:

step S1: collecting intelligent warehouse workshop operation data, preprocessing, constructing a Markov decision model, randomly initializing all states and values Q corresponding to actions, initializing all network parameters, emptying a experience playback set D, and completing data state modeling; randomly extracting data after state modeling, initializing a state S to be the first state of a current state sequence, and obtaining a characteristic vector of the data ；

Step S2: use in Q networksAs input, adding two sub-network structures in front of the output layer of the neural network, and obtaining an output action cost function Q by linearly adding a state-based cost function V and a state-based dominance function AF;

step S3: selecting a corresponding action A in the current Q value output by using an epsilon-greedy method to obtain time T spent on scheduling based on the action A, storing the time T in a set, and respectively evaluating the optimal actions when the time T is the same and different;

Step S4: designing a reward and punishment function, combining the path length and the road network load with the reward and punishment function to balance the road network load, and executing the current action A in the state S to obtain a feature vector corresponding to the new state S' The prize R and whether the state end is terminated, will {，A，R，End five-tuple is stored in experience playback set D;

Step S5: sampling m samples from the experience playback set D to calculate a current target Q value y _j, and updating all parameters w of the Q network by gradient back propagation of the neural network by using a mean square error loss function;

Step S6: if S' is the termination state, repeating the steps S2-S5 until the task is completed.

The beneficial effects of the invention are as follows:

1. According to the invention, by adopting Dueling DQN algorithm in the deep reinforcement learning method, the state cost function and the advantage function are separately modeled, and the AGV can better process the state with smaller relation with the action when dispatching under the actual condition. When no other vehicles exist around the AGV, only the state is concerned, when other vehicles exist, the difference of the dominant values of different actions is concerned, and finally, the action cost function Q is output by linear addition of the cost function of the concerned state and the dominant function of the concerned state-action.

2. The method aims at the problem of road network load congestion in AGV scheduling, designs a dynamic reward and punishment function, considers load factors in the reward and punishment function in the deep reinforcement learning iterative process, puts forward the combination of path length and road network load as the reward and punishment function, adjusts the reward value according to the real-time change of the road network congestion, finally balances the road network load, and avoids the problems of too slow speed and path conflict caused by a large number of AGVs due to the road network congestion.

Drawings

FIG. 1 is a flow chart of road network load balancing and task scheduling;

FIG. 2 is a Dueling DQN algorithm model diagram;

FIG. 3 is a flow chart of an AGV scheduling strategy;

FIG. 4 is a flow chart of road network load balancing;

fig. 5 is a task scheduling block diagram based on Dueling DQN algorithm.

Detailed Description

This embodiment is described in order to better explain the present embodiment. Some of the figures may be omitted, enlarged or reduced in size and do not represent actual dimensions;

It will be appreciated by those skilled in the art that the descriptions of certain well-known matters in the drawings may be omitted to some extent;

The invention provides a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm, and relates to intelligent warehouse workshop job scheduling. Vehicle track and operation data are collected in an intelligent warehouse workshop as historical big data, and a Markov decision model is built based on the collected data. Setting training data samples, optimizing a neural network structure of data by using Dueling DQN algorithm, and modeling the cost function and the dominance function separately by taking the action cost function Q of the output layer as the linear sum of the cost function and the dominance function, so that the intelligent agent can better process the state with smaller action association. The link between the reward and punishment function and the road network load is constructed, and the path length and the road network load are integrated into the reward and punishment function, so that the road network congestion problem can be better solved. And constructing task scheduling matching execution mechanisms focusing on the difference of state values and action dominance values under different conditions, performing scheduling matching according to different states and actions, and timely generating scheduling rules to guide next-step operation so as to rapidly realize task scheduling. The algorithm can efficiently optimize road network load, accurately match scheduling strategies of different states and actions, greatly save time cost and improve production efficiency.

The invention is described in further detail below with reference to the drawings and examples.

As shown in fig. 1: the embodiment of the invention provides a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm, which comprises the following steps:

In one embodiment, the step S1 includes: collecting intelligent warehouse workshop operation data, preprocessing, constructing a Markov decision model, randomly initializing all states and values Q corresponding to actions, initializing all network parameters, emptying a experience playback set D, and completing data state modeling; randomly extracting data after state modeling, initializing a state S to be the first state of a current state sequence, and obtaining a characteristic vector of the data；

Step S11: preprocessing and classifying the acquired AGV data, defining a strategy pi based on an established Markov decision model and formulating a cost function;

And collecting AGV production operation data in the intelligent warehouse workshop, and classifying the data according to different task requirements. Preprocessing the acquired data, and processing the data of different categories through data cleaning, data integration, data protocol and data transformation respectively.

A markov decision model is built, the definition of the markov decision process is < S, a, P, R, γ >, where state S represents a set of all states S, action a represents all sets related to action a in the decision process, P represents a conditional probability of selecting action a at state S, R represents cumulative rewards, and the final goal is to achieve rewards maximization, γ represents a discount factor. The transmission probability matrix and the reward function are used for defining, and the specific formula is as follows:

After defining the transmission probability matrix and the reward function, in the Markov decision model, the intelligent agent can choose to act according to the state, and the final choice must be that the environment is better and better. Policies are actions expressed by an agent for an environment, and the definition of a policy pi is shown in the following formula:

The policy pi is related to the current state only, history, and time. After defining the strategy of the Markov model, a cost function is formulated based on the strategy, and the cost function is respectively a state cost function and an action cost function, which are shown in the following formula:

v _π(s) represents the expectation of benefit in the state s, representing the value brought by the state; q _π (s, a) represents the expectation of revenue after taking action a in state s, representing the value brought by the action; g _t represents the sum of the cumulative rewards generated by the agent when interacting with the environment.

Step S12: completing state modeling of data and solving a Belman equation by using value iteration;

The collected production operation data are classified, link integration is carried out according to the logic of workshop operation production, and node states are manufactured for different divided objects respectively. The objects are divided into five classes, respectively: the system comprises an AGV normal running state, an AGV congestion stagnation state, a single cycle end efficiency value state, a same material reworking state and a road network congestion level state. For the above five classes of objects, the status classification is based on the AGVs, respectively. Target object state < target object 1 state, target object 2 state, target object 3 state, Based on the definition, the state of the target object i is converted into a value function based on a strategy by utilizing the Belman equation, the Belman equation is solved through value iteration, and the optimal solution is solved.

All different states of the target object can be linked to form the intelligent manufacturing scheduling system of the whole workshop. And mapping the multidimensional data of each state into an experience storage unit, extracting all data states of the AGV running in a certain time sequence in the production process according to the relation of the time sequence, mapping all state nodes one by one and numbering to obtain the running state of the AGV of the whole dispatching system. And linking the state data information of each dimension in the road network into a whole by using the established Markov decision model, and finally obtaining the AGV data in the complete Markov state. In a practical environment, an average speed and a rest speed are set for a plurality of AGVs, the rest speed is set to v=0 m/s, and the average speed is set to v=1 m/s. Still install infrared sensor for every AGV to every AGV can perception obstacle and other AGVs, realizes better obstacle avoidance, reduces load scheduling time.

In one embodiment, the step S2 includes: and initializing all states and values Q corresponding to the actions at random, initializing all parameters of the network, and emptying a set D of experience playback. Randomly extracting data after state modeling, initializing a state S to be the first state of a current state sequence, and obtaining a characteristic vector of the data；

Step S13: randomly initializing all parameters w of a Q network, initializing values Q corresponding to all states and actions based on the w, and emptying a set D of experience playback;

A pre-training dataset is prepared, the dataset including information of status, actions, rewards, and next status. A neural network of the same structure as the Q network is defined as a pre-training network, which does not need to output the Q value, but only the feature vector (or state representation), where a smaller output layer is used. In the pre-training network, the input state vector is mapped to a low-dimensional feature vector. The specific method uses convolutional neural networks for feature extraction. Taking the obtained feature vector as input, training a logistic regression classifier to predict the next state. And taking the parameters of the pre-training network as the initialization parameters w of the Q network, and performing reinforcement learning training. The Q network parameter w is initialized through a pre-training method, so that the Q network starts training from a better starting point, the network convergence speed is increased, and finally, more accurate Q value estimation is obtained.

The weights w of the neural network are initialized, and for each state S and action A, the combination of the states S and the action A is taken as input, and the corresponding Q value is obtained through the neural network. Clearing the experience playback set D refers to initializing it as an empty set. The experience playback set D is used to store data collected by the agent in the environment in previous interactions with the agent.

Step S14: the initialization state S is the first state of the current state sequence, and the feature vector is obtained；

And extracting the characteristic vector of the state by using a neural network according to the state generated by interaction of the AGV and the environment as a continuous state space. Since it is difficult to design manual features due to the large number of states in the continuous state space, the feature representation of the states is automatically learned using a neural network. Specifically, the state can be processed by using a convolutional neural network to obtain a high-dimensional characteristic vectorAs a representation of the state. Meanwhile, in order to cope with the situation that the training data distribution is exceeded, a data enhancement strategy is adopted, and an effective data set is enlarged by performing some transformation on the original state.

As shown in fig. 2: dueling DQN algorithm model diagram, in one embodiment, the step S2 includes: use in Q networksAs input, adding two sub-network structures in front of the output layer of the neural network, and obtaining an output action cost function Q by linearly adding a state-based cost function V and a state-based dominance function AF;

Step S21: building a convolutional neural network model, wherein the convolutional neural network model is formed by an input layer The data is input to carry out deep learning training;

The acquired road network data is imported and then the definition model is started. The model sets two convolution layers and a pooling layer, three Dropout layers, one planarization layer, two full connection layers. The ReLU activation function is used in the convolution layer, the ratio of the first and second Dropout layers is 0.25, the ratio of the third Dropout layer is 0.5, the ReLU function is used in the first full connection layer activation function, and the softmax function is used in the second full connection layer activation function. Kernel size represents the convolution Kernel size, padding represents zero padding to the same size, strides represents the padding stride. In the compiled model, the loss function is loss, the evaluation standard is accuracy, in the training model, the verification set is 20% of the training set, the training period is 30 times, and the batch size is 128. The formula of the convolutional neural network is shown as follows:

wherein Y represents an output value, f represents an activation function, w represents a weight matrix, x represents an input value, and b represents an offset.

Step S22: respectively establishing a sub-network structure model: a cost function V and a dominance function AF;

The Dueling DQN algorithm considers dividing the Q network into two parts, the first part being related to the state S only, independent of the action a to be adopted in particular, this part being called the cost function part, denoted V (S; w, α), and the second part being related to both the state S and the action a, this part being called the dominance function part, denoted AF (S, a; w, β). The final cost function is re-represented as follows:

where w is the network parameter of the public part, α is the network parameter of the unique part of the cost function, and β is the network parameter of the unique part of the dominance function; the Dueling DQN algorithm splits the abstract features extracted by the convolutional layer into two branches. One path represents a state value function V and represents the value of a static state environment; the other path represents an action dominance function AF of the dependent state, and represents the value additionally brought by selecting a certain action; and finally, the two paths are aggregated together to obtain the Q value of each action, so that the AGV is better adapted to different environments.

Step S23: after training data is output through the full connection layer, two sub-network structures are added in front of the output layer: the cost function V and the dominant function AF are subjected to linear addition and output;

After convolution and pooling operation, the obtained feature images are sequentially unfolded according to rows, connected into vectors, input into a fully connected network, respectively solve loss functions of a training set and a testing set to evaluate a model, and train by using gradient descent and a back propagation method.

In the output process, the method can be realized by the following formulaTo obtain the cost function, but this formula cannot identify the final output V (S; w, α) and AF (S, a; w, β), in order to embody such identifiability, the dominant function part is subjected to centering processing, and the actually used combination formula is shown as follows:

As shown in fig. 3: in one embodiment, the step S3 includes: selecting a corresponding action A in the current Q value output by using an epsilon-greedy method to obtain time T spent on scheduling based on the action A, storing the time T in a set, and respectively evaluating the optimal actions when the time T is the same and different;

step S31: selecting a corresponding action A in the current Q value output by using an epsilon-greedy method;

in order to better enable the intelligent agent to explore in the road network, an epsilon-greedy method is used for selecting the action A. In practical implementations, the probability of 1 ε is determined by the Q function, and ε is typically set to a small value and decreases over time, i.e., explores less and less.

Setting the exploration ε to 0.1, i.e., 90% probability, will determine the action according to the Q function, but 10% probability is random. Epsilon decreases with time in implementation, and at the beginning, because it is uncertain which action is better, it takes a great deal of effort to search. Then, as the training times are increased, the optimal Q value is obtained. At this time, the exploration is reduced, the epsilon value is reduced, and the action is determined according to the Q function, so that the intelligent agent can be clearly indicated in the road network.

Step S32: based on step S31, time T spent for scheduling by action a is obtained and stored in a collection;

When a new task arrives or the AGV completes the job, the schedule will be triggered. And then, after the system state module processes the complex real-time information, extracting key state information, and mainly dividing the key state information into task information and AGV state information. The status information is sent to a work module that constitutes the computational core of the system. The Q network module outputs the results to the combined action module by continuously training and learning the input states using Dueling DQN algorithm. And finally, feeding back the selected rule and the AGV to a dispatching system as a command for guiding the AGV to dispatch in real time.

And setting various scheduling rules to enable the AGV to finish the task in one period to be the shortest. The method comprises the steps of selecting tasks according to an arrival sequence, selecting tasks according to the shortest travel distance, selecting tasks with earliest deadlines, selecting tasks with longest waiting time and selecting tasks with closest AGVs to load points. And giving the rules different weight coefficients, finally obtaining the time T spent by scheduling according to the rules, and storing all the effective data in a set.

Step S33: based on the step S32, judging the time T spent by AGV scheduling in the set, and when the time T is different, selecting the corresponding scheduling strategy action A as the optimal action when the time T is the least;

Based on step S32, the time T in the set is determined, and when the times T in the set are not equal, the scheduling policy corresponding to the time T with the smallest time is determined as the optimal scheduling policy action, and this is taken as the scheduling action executed by the AGV in the next step. And only a minimum time T indicates that the action A corresponding to the time T is minimum in action, is most suitable for being used as the action to be executed next, can be directly executed without using a Q value, and finally refreshes the aggregation unit.

Step S34: based on step S32, determining the time T spent by the AGV scheduling in the set, and when the time T is the same, selecting the time T with the maximum action evaluation value Q as the optimal action;

When a plurality of equal minimum time T exists, selecting an action corresponding to the scheduling strategy A with the maximum action evaluation value Q as an optimal scheduling strategy action.

As shown in fig. 4: in one embodiment, the load balancing flowchart of the road network includes: designing a reward and punishment function, combining the path length and the road network load with the reward and punishment function to balance the road network load, and executing the current action A in the state S to obtain a feature vector corresponding to the new state S'The prize R and whether the state end is terminated, will {，A，R，End five-tuple is stored in experience playback set D;

step S41: designing a reward and punishment mechanism function, combining path length and road network load with the reward and punishment function, and considering load factors in the reward and punishment function in the deep reinforcement learning iterative process, wherein the reward and punishment function is set as shown in the following formula:

In the above equation r _t, δ∈ (0, 1) is the path length coefficient, d _t (x) is the sum of the paths taken by the AGV, d _t(x)= d₁+d₂+···+d_t, t∈ (1, n). η e (0, 1) is the Load factor, and Load (x) is the number of vehicles passing by the current node in the road network. When δ is equal to 0, the function considers only the load factor as a penalty value. When η is equal to 0, the function considers only the path length as a penalty value. When both are 0, the function does not take into account the load balancing situation. The path length and the load are set in a punishment function, so that when the AGV runs in the road network, an optimal path is selected according to the load capacity of the subareas or the length of the running length, a local high-load area is avoided, and finally the effect of optimizing the whole road network is achieved.

The reward R is generated after the AGV interacts with the environment to generate action A and the environment state changes from state S to new state S'. The accumulated rewards obtained after the AGV executes a sequence of actions are generally used for judging the advantages and disadvantages of the strategy, and the larger the accumulated rewards are, the better the strategy is considered, and the sum formula of the state accumulated rewards is shown as the following formula:

In the above formula, r _t +1 is the reward of the AGV for selecting and making environmental feedback after the action at the time t+1. Gamma e (0, 1) is a discount factor, and when the value of gamma is equal to 0, the AGV only considers the return of the next step. As the gamma value approaches 1, future rewards will be more taken into account. Sometimes the current prize is of greater concern, sometimes the future prize is of greater concern, and the gamma value is adjusted to 0.9.

After the punishment function is improved, the running times of the AGV are set to be 100u (u epsilon N ⁺) times of task quantity. And recording the path length of each AGV running in the road network, and extracting the road network load of each point in the road network map. The essence of AGV load balancing is that the load factor is considered in the actual path cost based on Dueling DQN algorithm. I.e. the combination of the running distance and the road network load, the setting of the punishment function can directly influence the running efficiency of the AGV in the road network.

Step S42: after the reward and punishment functions are designed, a Dueling DQN deep reinforcement learning algorithm is used for enabling a plurality of AGVs to interact with the road network environment, a high-load area in the road network is optimized, and the overall passing efficiency of the intelligent AGVs is improved;

after the design of the reward and punishment function is completed, the running times of the AGV are set to be 100t (t epsilon N ⁺) times of task quantity. And recording the path length of each AGV running in the road network, and extracting the road network load of each point in the map in the dispatching system. The essence of AGV load balancing is to take the load factor into account the actual path cost. I.e. the combination of the running distance and the road network load, the setting of the punishment function can directly influence the running efficiency of the AGV in the road network.

When an AGV load balancing experiment is carried out in a road network, the starting point and the end point of 100t AGV tasks are kept unchanged. And comparing the road network load conditions before and after load balancing, and respectively measuring and calculating the shortest running time of vehicle dispatching before and after balancing. The delta and eta values are tested for multiple times in the experiment to obtain an optimal value, and when the delta and eta values are both 0, the optimal value is the condition before load balancing. Firstly, one AGV operates in a road network, and then the load data of the current operation is updated. The same method is used by the following AGVs until all AGVs have updated the road network load.

Step S43: executing the current action A in the state S to obtain the feature vector corresponding to the new state SRewards R and whether to terminate the state end, and will {, A, R, End five-tuple is stored in experience playback set D;

And randomly selecting an initial state S, selecting an action A from all actions in the current state S to obtain a next state S', wherein each step of action is executed by the intelligent AGV, the system can score the action, the AGV can obtain a reward value, and finally, after a target task is achieved, the rewards obtained in each step are added, so that the finally obtained target rewards are obtained. Based on the state S ', a training module is built in the Q network by the model and used as production data to be input into the next step, the model can carry out real-time self-adaptive scheduling, a plurality of scheduling strategy actions A ' of production jobs are selected, and the scheduling strategy actions A ' are optimal actions selected based on the real-time state, so that an AGV can walk out of an optimal route in the scheduling process.

The state cost function and the dominance function are respectively modeled by combining the designed punishment and punishment functions, so that the AGV only pays attention to the value of the state in certain situations and does not pay attention to the difference caused by different actions, and the state and the dominance function are modeled separately at the moment, so that the state which is less relevant to the actions can be better processed by an agent. When there is no vehicle in front of the running AGV, the vehicle itself is not much different, at which time the AGV is more concerned with status value, while when there is a car in front of the AGV (the agent needs to overtake), the agent begins to pay attention to the difference of different action dominance values. At this time, action A is executed in the current state S, and feature vectors corresponding to the new state S' are sequentially obtainedThe prize R and whether the end state is terminated, and finally these values are stored in the set D.

In the step, each AGV generates a series of actions and environments to interact through an intelligent agent, obtains the state and rewards of the next moment after taking actions in the current state, then stores the state and rewards in a memory bank D of the step, randomly extracts a certain number of memories from the memory bank D as samples after accumulating a certain number of steps for memory, and learns; when the AGV reaches the final end point through continuous iteration, giving the AGV an end point reward according to a set reward and punishment function, wherein before the AGV walks to the end point, each step has single-step punishment, and a plurality of AGV position punishments and trap punishments, and setting the punishment values is also used for enabling the AGV to make mistakes as little as possible, walk to the end point as soon as possible, and obtain the maximum reward; the discount factor gamma is related to the time domain and is set for the AGV to obtain the maximum reward as soon as possible; action selection using the epsilon-greedy method, the initial value epsilon cannot generally be 0, as epsilon is not calculated to be inside the final average value, and epsilon generally decreases gradually over time.

As shown in fig. 5: task scheduling structure based on Dueling DQN algorithm, in one embodiment, the step S5 includes: sampling m samples from the experience playback set D to calculate a current target Q value y _j, and updating all parameters w of the Q network by gradient back propagation of the neural network by using a mean square error loss function;

Step S51: sampling m samples in the experience playback set D to calculate a current target Q value y _j;

And calculating Q (S, A, w, alpha, beta), judging whether the obtained Q matrix is converged or not, if not, returning to re-calculation, and if so, carrying out the next step to finish the learning of the target Q value. Wherein s1, s2, s3, s4 respectively represent the state of each time of the intelligent AGV, and a1, a2, a3, a4 respectively represent the corresponding action generated by each state. And finally judging whether the obtained Q value is the choice under the optimal action, and completing the calculation of the target Q value.

Generally, when the Q value is estimated by using a convolutional neural network, the Q value is overestimated, according to the formulaThe state-cost function is deformed as shown in the following formula:

Quantitative analysis Q _w-(s, a)-V^* for overestimation of the Q value obeys the uniform independent same distribution between [ -1, 1] and the action space size is set as h, then for any state s:

The estimation error is noted as σ=q _w-(s, a)-max_a'Q^* (s, a'), since the estimation error is independent for different actions, there are:

P (σ _a.ltoreq.x) is a cumulative distribution function of σ _a, which can be written specifically as:

Thus, the cumulative distribution function for max _aσ_a is found as follows:

the final deformation can be obtained:

In an actual road network environment, when AGV scheduling based on Dueling DQN algorithm calculates a current target Q value, the Q value can be increased along with the increase of the action space h, and in an environment with more action choices, the Q value can have the problem of overestimation, and based on the problem, the selection of the action space is calculated with emphasis, so that the model can find the Q value which makes the AGV scheduling time shortest when estimating the Q value.

Step S52: updating all parameters w of the Q network through gradient back propagation of the neural network by using a mean square error loss function;

Training and updating according to the designed state function and the designed advantage function and the following formula:

Wherein the method comprises the steps of

In the gradient calculation process, in order to make the dominance function of the neural network zero, the estimated dominance function and the controlled dominance function are used to obtain a zero equation:

at this time, for the case of policy evaluation, the expression of the Q value is as in the formula of step S33 As shown, for the optimal control case, the improvement is as follows:

After the improvement is finished, the optimal action can be accurately learned by the V (S; w, beta), so that the advantage function AF (S, A'; w, alpha) is ensured to be more accurate.

In one embodiment, in the step S6, if S' is in the end state, the steps S2 to S5 are repeated until the task is completed, including:

When the step S6 is completed, if the state S' is a termination state or the loss function is too large to enable the model to converge, the steps S2-S5 are continuously repeated; if the operation is not in the termination state, the operation is completed, and finally, the shortest scheduling time under different requirements is respectively compared through different requirements to obtain the optimal strategy.

The above examples are provided for the purpose of describing the present invention only and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalents and modifications that do not depart from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm is characterized by comprising the following steps:

Step S1: collecting intelligent warehouse workshop operation data, preprocessing, constructing a Markov decision model, randomly initializing all states and values Q corresponding to actions, initializing all network parameters, emptying a experience playback set D, and completing data state modeling; randomly extracting data after state modeling, initializing a state S as a first state of a current state sequence, and acquiring a characteristic vector phi (S) of the data;

step S2: using phi(s) as input in a Q network, adding two sub-network structures in front of an output layer of a neural network, and linearly adding an output action cost function Q by a state-based cost function V and a state-based dominance function AF;

Step S3: selecting a corresponding action A in the current Q value output by using an epsilon-greedy method to obtain time T spent on scheduling based on the action A, storing the time T in a set, and respectively evaluating the optimal actions when the time T is the same and different; the step S3 includes:

When a new task arrives or the AGV completes work, triggering scheduling; then, after processing the complex real-time information, the system state module extracts key state information which is divided into task information and AGV state information; the status information is sent to a work module, which constitutes the computational core of the system; the Q network module carries out continuous training and learning on the input state by using Dueling DQN algorithm and outputs the result to the combined action module; finally, feeding back the selected rule and the AGV to a dispatching system to be used as a command for guiding the AGV to dispatch in real time;

Step S4: designing a reward and punishment function, combining the path length and the road network load with the reward and punishment function to balance the road network load, executing the current action A in the state S to obtain a characteristic vector phi (S ') corresponding to the new state S ', a reward R and whether to terminate the state end, and storing the { phi (S), A, R, phi (S ') end five-tuple into an experience playback set D; the step S4 includes:

Step S41: designing a reward and punishment mechanism function, combining path length and road network load with the reward and punishment function, considering load factors in the reward and punishment function in the deep reinforcement learning iterative process, and setting the function as follows:

In the above formula r _t, δ∈ (0, 1) is a path length coefficient, d _t (x) is the sum of paths travelled by the AGV, d _t(x)＝d₁+d₂+···+d_t, t∈ (1, n), η∈ (0, 1) is a Load coefficient, and Load (x) is the number of vehicles passing by the current node in the road network;

step S43: executing the current action A in the state S to obtain a feature vector phi (S '), a reward R and whether to terminate the state end corresponding to the new state S ', and storing { phi (S), A, R, phi (S ') end } five-tuple into an experience playback set D;

Randomly selecting an initial state S, selecting an action A from all possible actions in the current state S to obtain a next state S', wherein each step of action is executed by the intelligent AGV, the system can score the action, the AGV can obtain a reward value, and after a target task is finally achieved, the rewards obtained in each step are added to obtain the target reward finally; based on the state S ', a training module is built in a Q network by the model and is used as production data to be input into the next step, the model can carry out real-time self-adaptive scheduling, a plurality of scheduling strategy actions A ' of production jobs are selected, and the scheduling strategy actions A ' are optimal actions selected based on the real-time state, so that an AGV can walk out of an optimal route in the scheduling process;

Step S5: sampling m samples from the experience playback set D to calculate a current target Q value y _j, and updating all parameters w of the Q network by gradient back propagation of the neural network by using a mean square error loss function; the step S5 includes:

the state-cost function is deformed as shown in the following formula:

Quantitative analysis Q _w-(s,a)-V^* for overestimation of the Q value obeys the uniform independent same distribution between [ -1,1] and the action space size is set as h, then for any state s:

the estimation error is noted as σ=q _w-(s,a)-max_a'Q^* (s, a'), since the estimation error is independent for different actions, there are:

p (σ _a.ltoreq.x) is the cumulative distribution function of σ _a, which is written specifically as:

Thus, the cumulative distribution function for max _aσ_a is found as follows:

And finally, deforming to obtain:

In an actual road network environment, when AGV scheduling based on Dueling DQN algorithm calculates a current target Q value, the Q value can be increased along with the increase of the action space h, and in an environment with more action selection numbers, the Q value can have the problem of overestimation; based on this, the selection of the action space is calculated to ensure that the model finds the Q value that minimizes the AGV schedule time when estimating the Q value;

step S52: updating all network parameters w of the Q network by gradient back propagation of the neural network by using a mean square error loss function;

Wherein the method comprises the steps of

y_i＝r+γmax_a′Q(s′,a′;w^-)

At this time, for the case of policy evaluation, the expression of the Q value is shown as the formula of step S33, and for the case of optimal control, the following expression is improved:

Q(S,A；w,α,β)＝V(S；w,α)+(AF(S,A；w,β)-max_a′∈ΛAF(S,a′；w,β))

after the improvement is finished, the optimal action of the reverse force V (S; w, alpha) is accurately learned, so that the advantage function AF (S, a'; w, beta) is ensured to be more accurate;

Step S6: the state S' is a termination state or the loss function is too large to enable the model to be converged, and the steps S2 to S5 are continuously repeated; if the operation is not in the termination state, the operation is completed, and finally, the shortest scheduling time under different requirements is respectively compared through different requirements to obtain the optimal strategy.

2. The method for balancing loads and scheduling tasks of multiple AGVs based on Dueling DQN algorithm as claimed in claim 1, wherein said step S1 includes:

Establishing a Markov decision model, wherein the definition of the Markov decision process is < S, A, P, R, gamma >, the state S represents the set of all states S, the action A represents all sets related to the action a in the decision process, the P represents the conditional probability of selecting the action A when the state S is present, and the R represents the accumulated return; the final goal is to achieve prize maximization; the transmission probability matrix and the reward function are used for defining, and the specific formula is as follows:

After defining the transmission probability matrix and the reward function, the definition of the policy pi is as follows:

π(a|s)＝P(A_t＝a|S_t＝s)

a state cost function and an action cost function, as shown in the following equation:

v _π(s) represents the expectation of benefit in the state s, representing the value brought by the state; q _π (s, a) represents the expectation of revenue after taking action a in state s, representing the value brought by the action; g _t represents the sum of accumulated rewards generated by the agent when interacting with the environment;

Classifying the collected production operation data, wherein the target object state is the target object 1 state, the target object 2 state, the target object 3 state, … … and the target object i state, converting a value function based on a strategy by utilizing a Bellman equation, solving the equation through value iteration, and solving an optimal solution;

Step S13: randomly initializing all network parameters w of a Q network, initializing values Q corresponding to all states and actions based on w, and emptying a set D of experience playback;

step S14: the initialization state S is the first state of the current state sequence, and the feature vector phi (S) is acquired.

3. The method for balancing loads and scheduling tasks of multiple AGVs based on the Dueling DQN algorithm according to claim 1, wherein the step S2 comprises:

Step S21: building a convolutional neural network model, and inputting phi(s) into the data by an input layer to perform deep learning training;

Importing the acquired road network data, and then starting defining a model; the model sets two convolution layers and a pooling layer, three Dropout layers, a flattening layer and two full connection layers; the method comprises the steps that a ReLU activation function is used as an activation function in a convolution layer, the ratio of a first Dropout layer to a second Dropout layer is 0.25, the ratio of a third Dropout layer is 0.5, the ReLU function is used as a first full-connection layer activation function, and the softmax function is used as a second full-connection layer activation function; kernel_size represents the convolution Kernel size, padding represents zero padding to the same size, strides represents the padding stride; in the compiling model, the loss function is loss, the evaluation standard is accuracy, in the training model, the verification set is 20% of the training set, the training period is 30 times, and the batch size is 128; the formula of the convolutional neural network is shown as follows: wherein Y represents an output value, f represents an activation function, w represents a network parameter, x represents an input value, and b represents an offset;

Y＝f(w*x+b)

the Dueling DQN algorithm considers the Q network to be divided into two parts, the first part being related to the state S only, independent of the action a to be adopted specifically, this part being called the cost function part, denoted V (S; w, α), the second part being related to both the state S and the action a, this part being called the dominance function part, denoted AF (S, a; w, β), the final cost function being expressed again as follows:

Q(S,A；w,α,β)＝V(S；w,α)+AF(S,A；w,β)

where w is the network parameter of the Q network, α is the network parameter of the unique part of the cost function, and β is the network parameter of the unique part of the dominance function; the Dueling DQN algorithm divides the abstract features extracted by the convolution layer into two branches; one path represents a state value function V and represents the value of a static state environment; the other path represents an action dominance function AF of the dependent state, and represents the value additionally brought by selecting a certain action; finally, the two paths are aggregated together to obtain the Q value of each action, so that the AGV is better adapted to different environments;

After convolution and pooling operation, sequentially expanding the obtained feature graphs according to rows, connecting the feature graphs into vectors, inputting the vectors into a fully-connected network, respectively solving loss functions of a training set and a testing set to evaluate a model, and training by using a gradient descent and back propagation method;

in the output process, a cost function is obtained, but the equation cannot identify the V (S; w, alpha) and AF (S, A; w, beta) in the final output, and in order to embody the identifiability, the dominant function part is subjected to centering treatment, and the actually used combination formula is shown as follows: