CN117474295B - Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method - Google Patents

Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method Download PDF

Info

Publication number
CN117474295B
CN117474295B CN202311805708.XA CN202311805708A CN117474295B CN 117474295 B CN117474295 B CN 117474295B CN 202311805708 A CN202311805708 A CN 202311805708A CN 117474295 B CN117474295 B CN 117474295B
Authority
CN
China
Prior art keywords
state
function
action
value
agv
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311805708.XA
Other languages
Chinese (zh)
Other versions
CN117474295A (en
Inventor
张秀梅
李文松
李慧
刘芳
刘方达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Technology
Original Assignee
Changchun University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Technology filed Critical Changchun University of Technology
Priority to CN202311805708.XA priority Critical patent/CN117474295B/en
Publication of CN117474295A publication Critical patent/CN117474295A/en
Application granted granted Critical
Publication of CN117474295B publication Critical patent/CN117474295B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06312Adjustment or analysis of established resource schedule, e.g. resource or task levelling, or dynamic rescheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
    • G06Q10/087Inventory or stock management, e.g. order filling, procurement or balancing against orders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/04Manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Development Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Game Theory and Decision Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Manufacturing & Machinery (AREA)
  • Educational Administration (AREA)
  • Primary Health Care (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)

Abstract

The invention provides a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm, and relates to the field of automatic distribution of warehouse logistics in intelligent workshops. Production job data is collected at a plant and a Markov decision model is built based on the data. Setting training data samples, optimizing a neural network structure of data by using Dueling DQN algorithm, and modeling the cost function and the dominance function separately by using the output layer action cost function Q as the linear sum of the cost function and the dominance function, so that the intelligent agent can better process the state with smaller action association. And constructing a relationship between the reward and punishment function and the road network load, and integrating the path length and the road network load into the reward and punishment function. The task scheduling matching execution mechanism for constructing the difference between the attention state value and the action dominant value can be applied to the AGV task scheduling field of the warehouse workshop. Compared with the prior art, the method can efficiently optimize road network load, accurately match different states and action scheduling strategies, and greatly improve production efficiency.

Description

Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method
Technical Field
The invention belongs to the field of automatic distribution of warehouse logistics in intelligent workshops, and particularly relates to a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm.
Background
Under the large background of intelligent manufacturing, intelligent warehouse workshop job scheduling has become one of key links for improving resource utilization rate and thus improving enterprise operation benefits. How to avoid the conflict of a plurality of AGVs in the workshop road network, improve the scheduling efficiency of the intelligent workshop, rationalize the use of road network resources and be the key place for researching the problems of load balancing and task scheduling.
Currently, from the application scene, the research of the AGV scheduling problem can be divided into a path planning problem and a task allocation problem, the single AGV path planning problem only needs to consider how one AGV bypasses an obstacle to find an optimal path, the research of the problem is mature, a graph theory algorithm is usually adopted, and a plurality of heuristic algorithms are applied to the problem. However, in practical situations, a plurality of AGVs perform tasks simultaneously, mutual interference and collision can be generated among the AGVs, and the large-scale AGVs of the logistics factory can generate the congestion problem of the whole road network load. For the problem of load balancing, most heuristic algorithms do not have considerable stability, and the performance of the heuristic algorithms is directly dependent on the simplicity of the problem and the experience of an expert.
When the AGV is in scheduling, a plurality of scheduling rules exist, namely tasks are selected based on the arrival sequence, tasks are selected based on the shortest travel and tasks are selected based on the longest waiting time, and a designer selects different scheduling rules according to different requirements. The common algorithms include an A-algorithm, a genetic algorithm and a simulated annealing algorithm, and each AGV can be respectively scheduled in a single small-scale manner due to instability of the common algorithms. In order to realize the dispatching of large-scale AGVs, some people use deep reinforcement learning to conduct intelligent workshop task dispatching, a plurality of AGVs are placed in a road network to interact with the environment in real time, an intelligent body randomly selects an action according to the current task state, then the action is scored, different reward rules can be designed according to different user requirements, and iterative updating is conducted in sequence until the task is completed.
The deep learning DQN algorithm can obtain a better solution on the large-scale AGV scheduling problem, but the mere use of the algorithm can cause overestimation, so that the scheduling result has large error, and the trained model has poor effect. In order to improve the training effect of the model, the Nature DQN uses a new target Q network with the same structure to calculate the target Q value by using two identical neural networks to solve the correlation between the data sample and the network before training, but the accuracy problem of the target Q value is not guaranteed. The Double DQN algorithm can eliminate the problem of overestimation by decoupling the selection of the target Q action and the calculation of the target Q, but this algorithm requires a lot of experience to train, which is difficult to get at the beginning and thus the initial performance of the algorithm is poor.
Disclosure of Invention
In order to solve the problems, the embodiment of the invention provides a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm, which improves reward and punishment functions, combines path length and load coefficients with the reward and punishment functions, optimizes a road network high load area and improves overall operation smoothness. Optimizing the neural network structure of the output end, constructing task scheduling matching execution mechanisms of differences of the attention state value and the action dominant value under different conditions, linearly adding the attention state value function and the attention state-action dominant function to output an action value function Q, and finally selecting actions based on the designed scheduling spending time T, and constructing the task scheduling together with data in an experience pool.
The technical scheme adopted for solving the technical problems is as follows: a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm comprises the following steps:
step S1: collecting intelligent warehouse workshop operation data, preprocessing, constructing a Markov decision model, randomly initializing all states and values Q corresponding to actions, initializing all network parameters, emptying a experience playback set D, and completing data state modeling; randomly extracting data after state modeling, initializing a state S to be the first state of a current state sequence, and obtaining a characteristic vector of the data
Step S2: use in Q networksAs input, adding two sub-network structures in front of the output layer of the neural network, and obtaining an output action cost function Q by linearly adding a state-based cost function V and a state-based dominance function AF;
step S3: selecting a corresponding action A in the current Q value output by using an epsilon-greedy method to obtain time T spent on scheduling based on the action A, storing the time T in a set, and respectively evaluating the optimal actions when the time T is the same and different;
Step S4: designing a reward and punishment function, combining the path length and the road network load with the reward and punishment function to balance the road network load, and executing the current action A in the state S to obtain a feature vector corresponding to the new state S' The prize R and whether the state end is terminated, will {,A,R,End five-tuple is stored in experience playback set D;
Step S5: sampling m samples from the experience playback set D to calculate a current target Q value y j, and updating all parameters w of the Q network by gradient back propagation of the neural network by using a mean square error loss function;
Step S6: if S' is the termination state, repeating the steps S2-S5 until the task is completed.
The beneficial effects of the invention are as follows:
1. According to the invention, by adopting Dueling DQN algorithm in the deep reinforcement learning method, the state cost function and the advantage function are separately modeled, and the AGV can better process the state with smaller relation with the action when dispatching under the actual condition. When no other vehicles exist around the AGV, only the state is concerned, when other vehicles exist, the difference of the dominant values of different actions is concerned, and finally, the action cost function Q is output by linear addition of the cost function of the concerned state and the dominant function of the concerned state-action.
2. The method aims at the problem of road network load congestion in AGV scheduling, designs a dynamic reward and punishment function, considers load factors in the reward and punishment function in the deep reinforcement learning iterative process, puts forward the combination of path length and road network load as the reward and punishment function, adjusts the reward value according to the real-time change of the road network congestion, finally balances the road network load, and avoids the problems of too slow speed and path conflict caused by a large number of AGVs due to the road network congestion.
Drawings
FIG. 1 is a flow chart of road network load balancing and task scheduling;
FIG. 2 is a Dueling DQN algorithm model diagram;
FIG. 3 is a flow chart of an AGV scheduling strategy;
FIG. 4 is a flow chart of road network load balancing;
fig. 5 is a task scheduling block diagram based on Dueling DQN algorithm.
Detailed Description
This embodiment is described in order to better explain the present embodiment. Some of the figures may be omitted, enlarged or reduced in size and do not represent actual dimensions;
It will be appreciated by those skilled in the art that the descriptions of certain well-known matters in the drawings may be omitted to some extent;
The invention provides a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm, and relates to intelligent warehouse workshop job scheduling. Vehicle track and operation data are collected in an intelligent warehouse workshop as historical big data, and a Markov decision model is built based on the collected data. Setting training data samples, optimizing a neural network structure of data by using Dueling DQN algorithm, and modeling the cost function and the dominance function separately by taking the action cost function Q of the output layer as the linear sum of the cost function and the dominance function, so that the intelligent agent can better process the state with smaller action association. The link between the reward and punishment function and the road network load is constructed, and the path length and the road network load are integrated into the reward and punishment function, so that the road network congestion problem can be better solved. And constructing task scheduling matching execution mechanisms focusing on the difference of state values and action dominance values under different conditions, performing scheduling matching according to different states and actions, and timely generating scheduling rules to guide next-step operation so as to rapidly realize task scheduling. The algorithm can efficiently optimize road network load, accurately match scheduling strategies of different states and actions, greatly save time cost and improve production efficiency.
The invention is described in further detail below with reference to the drawings and examples.
As shown in fig. 1: the embodiment of the invention provides a multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm, which comprises the following steps:
step S1: collecting intelligent warehouse workshop operation data, preprocessing, constructing a Markov decision model, randomly initializing all states and values Q corresponding to actions, initializing all network parameters, emptying a experience playback set D, and completing data state modeling; randomly extracting data after state modeling, initializing a state S to be the first state of a current state sequence, and obtaining a characteristic vector of the data
Step S2: use in Q networksAs input, adding two sub-network structures in front of the output layer of the neural network, and obtaining an output action cost function Q by linearly adding a state-based cost function V and a state-based dominance function AF;
step S3: selecting a corresponding action A in the current Q value output by using an epsilon-greedy method to obtain time T spent on scheduling based on the action A, storing the time T in a set, and respectively evaluating the optimal actions when the time T is the same and different;
Step S4: designing a reward and punishment function, combining the path length and the road network load with the reward and punishment function to balance the road network load, and executing the current action A in the state S to obtain a feature vector corresponding to the new state S' The prize R and whether the state end is terminated, will {,A,R,End five-tuple is stored in experience playback set D;
Step S5: sampling m samples from the experience playback set D to calculate a current target Q value y j, and updating all parameters w of the Q network by gradient back propagation of the neural network by using a mean square error loss function;
Step S6: if S' is the termination state, repeating the steps S2-S5 until the task is completed.
In one embodiment, the step S1 includes: collecting intelligent warehouse workshop operation data, preprocessing, constructing a Markov decision model, randomly initializing all states and values Q corresponding to actions, initializing all network parameters, emptying a experience playback set D, and completing data state modeling; randomly extracting data after state modeling, initializing a state S to be the first state of a current state sequence, and obtaining a characteristic vector of the data
Step S11: preprocessing and classifying the acquired AGV data, defining a strategy pi based on an established Markov decision model and formulating a cost function;
And collecting AGV production operation data in the intelligent warehouse workshop, and classifying the data according to different task requirements. Preprocessing the acquired data, and processing the data of different categories through data cleaning, data integration, data protocol and data transformation respectively.
A markov decision model is built, the definition of the markov decision process is < S, a, P, R, γ >, where state S represents a set of all states S, action a represents all sets related to action a in the decision process, P represents a conditional probability of selecting action a at state S, R represents cumulative rewards, and the final goal is to achieve rewards maximization, γ represents a discount factor. The transmission probability matrix and the reward function are used for defining, and the specific formula is as follows:
After defining the transmission probability matrix and the reward function, in the Markov decision model, the intelligent agent can choose to act according to the state, and the final choice must be that the environment is better and better. Policies are actions expressed by an agent for an environment, and the definition of a policy pi is shown in the following formula:
The policy pi is related to the current state only, history, and time. After defining the strategy of the Markov model, a cost function is formulated based on the strategy, and the cost function is respectively a state cost function and an action cost function, which are shown in the following formula:
v π(s) represents the expectation of benefit in the state s, representing the value brought by the state; q π (s, a) represents the expectation of revenue after taking action a in state s, representing the value brought by the action; g t represents the sum of the cumulative rewards generated by the agent when interacting with the environment.
Step S12: completing state modeling of data and solving a Belman equation by using value iteration;
The collected production operation data are classified, link integration is carried out according to the logic of workshop operation production, and node states are manufactured for different divided objects respectively. The objects are divided into five classes, respectively: the system comprises an AGV normal running state, an AGV congestion stagnation state, a single cycle end efficiency value state, a same material reworking state and a road network congestion level state. For the above five classes of objects, the status classification is based on the AGVs, respectively. Target object state < target object 1 state, target object 2 state, target object 3 state, Based on the definition, the state of the target object i is converted into a value function based on a strategy by utilizing the Belman equation, the Belman equation is solved through value iteration, and the optimal solution is solved.
All different states of the target object can be linked to form the intelligent manufacturing scheduling system of the whole workshop. And mapping the multidimensional data of each state into an experience storage unit, extracting all data states of the AGV running in a certain time sequence in the production process according to the relation of the time sequence, mapping all state nodes one by one and numbering to obtain the running state of the AGV of the whole dispatching system. And linking the state data information of each dimension in the road network into a whole by using the established Markov decision model, and finally obtaining the AGV data in the complete Markov state. In a practical environment, an average speed and a rest speed are set for a plurality of AGVs, the rest speed is set to v=0 m/s, and the average speed is set to v=1 m/s. Still install infrared sensor for every AGV to every AGV can perception obstacle and other AGVs, realizes better obstacle avoidance, reduces load scheduling time.
In one embodiment, the step S2 includes: and initializing all states and values Q corresponding to the actions at random, initializing all parameters of the network, and emptying a set D of experience playback. Randomly extracting data after state modeling, initializing a state S to be the first state of a current state sequence, and obtaining a characteristic vector of the data
Step S13: randomly initializing all parameters w of a Q network, initializing values Q corresponding to all states and actions based on the w, and emptying a set D of experience playback;
A pre-training dataset is prepared, the dataset including information of status, actions, rewards, and next status. A neural network of the same structure as the Q network is defined as a pre-training network, which does not need to output the Q value, but only the feature vector (or state representation), where a smaller output layer is used. In the pre-training network, the input state vector is mapped to a low-dimensional feature vector. The specific method uses convolutional neural networks for feature extraction. Taking the obtained feature vector as input, training a logistic regression classifier to predict the next state. And taking the parameters of the pre-training network as the initialization parameters w of the Q network, and performing reinforcement learning training. The Q network parameter w is initialized through a pre-training method, so that the Q network starts training from a better starting point, the network convergence speed is increased, and finally, more accurate Q value estimation is obtained.
The weights w of the neural network are initialized, and for each state S and action A, the combination of the states S and the action A is taken as input, and the corresponding Q value is obtained through the neural network. Clearing the experience playback set D refers to initializing it as an empty set. The experience playback set D is used to store data collected by the agent in the environment in previous interactions with the agent.
Step S14: the initialization state S is the first state of the current state sequence, and the feature vector is obtained
And extracting the characteristic vector of the state by using a neural network according to the state generated by interaction of the AGV and the environment as a continuous state space. Since it is difficult to design manual features due to the large number of states in the continuous state space, the feature representation of the states is automatically learned using a neural network. Specifically, the state can be processed by using a convolutional neural network to obtain a high-dimensional characteristic vectorAs a representation of the state. Meanwhile, in order to cope with the situation that the training data distribution is exceeded, a data enhancement strategy is adopted, and an effective data set is enlarged by performing some transformation on the original state.
As shown in fig. 2: dueling DQN algorithm model diagram, in one embodiment, the step S2 includes: use in Q networksAs input, adding two sub-network structures in front of the output layer of the neural network, and obtaining an output action cost function Q by linearly adding a state-based cost function V and a state-based dominance function AF;
Step S21: building a convolutional neural network model, wherein the convolutional neural network model is formed by an input layer The data is input to carry out deep learning training;
The acquired road network data is imported and then the definition model is started. The model sets two convolution layers and a pooling layer, three Dropout layers, one planarization layer, two full connection layers. The ReLU activation function is used in the convolution layer, the ratio of the first and second Dropout layers is 0.25, the ratio of the third Dropout layer is 0.5, the ReLU function is used in the first full connection layer activation function, and the softmax function is used in the second full connection layer activation function. Kernel size represents the convolution Kernel size, padding represents zero padding to the same size, strides represents the padding stride. In the compiled model, the loss function is loss, the evaluation standard is accuracy, in the training model, the verification set is 20% of the training set, the training period is 30 times, and the batch size is 128. The formula of the convolutional neural network is shown as follows:
wherein Y represents an output value, f represents an activation function, w represents a weight matrix, x represents an input value, and b represents an offset.
Step S22: respectively establishing a sub-network structure model: a cost function V and a dominance function AF;
The Dueling DQN algorithm considers dividing the Q network into two parts, the first part being related to the state S only, independent of the action a to be adopted in particular, this part being called the cost function part, denoted V (S; w, α), and the second part being related to both the state S and the action a, this part being called the dominance function part, denoted AF (S, a; w, β). The final cost function is re-represented as follows:
where w is the network parameter of the public part, α is the network parameter of the unique part of the cost function, and β is the network parameter of the unique part of the dominance function; the Dueling DQN algorithm splits the abstract features extracted by the convolutional layer into two branches. One path represents a state value function V and represents the value of a static state environment; the other path represents an action dominance function AF of the dependent state, and represents the value additionally brought by selecting a certain action; and finally, the two paths are aggregated together to obtain the Q value of each action, so that the AGV is better adapted to different environments.
Step S23: after training data is output through the full connection layer, two sub-network structures are added in front of the output layer: the cost function V and the dominant function AF are subjected to linear addition and output;
After convolution and pooling operation, the obtained feature images are sequentially unfolded according to rows, connected into vectors, input into a fully connected network, respectively solve loss functions of a training set and a testing set to evaluate a model, and train by using gradient descent and a back propagation method.
In the output process, the method can be realized by the following formulaTo obtain the cost function, but this formula cannot identify the final output V (S; w, α) and AF (S, a; w, β), in order to embody such identifiability, the dominant function part is subjected to centering processing, and the actually used combination formula is shown as follows:
As shown in fig. 3: in one embodiment, the step S3 includes: selecting a corresponding action A in the current Q value output by using an epsilon-greedy method to obtain time T spent on scheduling based on the action A, storing the time T in a set, and respectively evaluating the optimal actions when the time T is the same and different;
step S31: selecting a corresponding action A in the current Q value output by using an epsilon-greedy method;
in order to better enable the intelligent agent to explore in the road network, an epsilon-greedy method is used for selecting the action A. In practical implementations, the probability of 1 ε is determined by the Q function, and ε is typically set to a small value and decreases over time, i.e., explores less and less.
Setting the exploration ε to 0.1, i.e., 90% probability, will determine the action according to the Q function, but 10% probability is random. Epsilon decreases with time in implementation, and at the beginning, because it is uncertain which action is better, it takes a great deal of effort to search. Then, as the training times are increased, the optimal Q value is obtained. At this time, the exploration is reduced, the epsilon value is reduced, and the action is determined according to the Q function, so that the intelligent agent can be clearly indicated in the road network.
Step S32: based on step S31, time T spent for scheduling by action a is obtained and stored in a collection;
When a new task arrives or the AGV completes the job, the schedule will be triggered. And then, after the system state module processes the complex real-time information, extracting key state information, and mainly dividing the key state information into task information and AGV state information. The status information is sent to a work module that constitutes the computational core of the system. The Q network module outputs the results to the combined action module by continuously training and learning the input states using Dueling DQN algorithm. And finally, feeding back the selected rule and the AGV to a dispatching system as a command for guiding the AGV to dispatch in real time.
And setting various scheduling rules to enable the AGV to finish the task in one period to be the shortest. The method comprises the steps of selecting tasks according to an arrival sequence, selecting tasks according to the shortest travel distance, selecting tasks with earliest deadlines, selecting tasks with longest waiting time and selecting tasks with closest AGVs to load points. And giving the rules different weight coefficients, finally obtaining the time T spent by scheduling according to the rules, and storing all the effective data in a set.
Step S33: based on the step S32, judging the time T spent by AGV scheduling in the set, and when the time T is different, selecting the corresponding scheduling strategy action A as the optimal action when the time T is the least;
Based on step S32, the time T in the set is determined, and when the times T in the set are not equal, the scheduling policy corresponding to the time T with the smallest time is determined as the optimal scheduling policy action, and this is taken as the scheduling action executed by the AGV in the next step. And only a minimum time T indicates that the action A corresponding to the time T is minimum in action, is most suitable for being used as the action to be executed next, can be directly executed without using a Q value, and finally refreshes the aggregation unit.
Step S34: based on step S32, determining the time T spent by the AGV scheduling in the set, and when the time T is the same, selecting the time T with the maximum action evaluation value Q as the optimal action;
When a plurality of equal minimum time T exists, selecting an action corresponding to the scheduling strategy A with the maximum action evaluation value Q as an optimal scheduling strategy action.
As shown in fig. 4: in one embodiment, the load balancing flowchart of the road network includes: designing a reward and punishment function, combining the path length and the road network load with the reward and punishment function to balance the road network load, and executing the current action A in the state S to obtain a feature vector corresponding to the new state S'The prize R and whether the state end is terminated, will {,A,R,End five-tuple is stored in experience playback set D;
step S41: designing a reward and punishment mechanism function, combining path length and road network load with the reward and punishment function, and considering load factors in the reward and punishment function in the deep reinforcement learning iterative process, wherein the reward and punishment function is set as shown in the following formula:
In the above equation r t, δ∈ (0, 1) is the path length coefficient, d t (x) is the sum of the paths taken by the AGV, d t(x)= d1+d2+···+dt, t∈ (1, n). η e (0, 1) is the Load factor, and Load (x) is the number of vehicles passing by the current node in the road network. When δ is equal to 0, the function considers only the load factor as a penalty value. When η is equal to 0, the function considers only the path length as a penalty value. When both are 0, the function does not take into account the load balancing situation. The path length and the load are set in a punishment function, so that when the AGV runs in the road network, an optimal path is selected according to the load capacity of the subareas or the length of the running length, a local high-load area is avoided, and finally the effect of optimizing the whole road network is achieved.
The reward R is generated after the AGV interacts with the environment to generate action A and the environment state changes from state S to new state S'. The accumulated rewards obtained after the AGV executes a sequence of actions are generally used for judging the advantages and disadvantages of the strategy, and the larger the accumulated rewards are, the better the strategy is considered, and the sum formula of the state accumulated rewards is shown as the following formula:
In the above formula, r t +1 is the reward of the AGV for selecting and making environmental feedback after the action at the time t+1. Gamma e (0, 1) is a discount factor, and when the value of gamma is equal to 0, the AGV only considers the return of the next step. As the gamma value approaches 1, future rewards will be more taken into account. Sometimes the current prize is of greater concern, sometimes the future prize is of greater concern, and the gamma value is adjusted to 0.9.
After the punishment function is improved, the running times of the AGV are set to be 100u (u epsilon N +) times of task quantity. And recording the path length of each AGV running in the road network, and extracting the road network load of each point in the road network map. The essence of AGV load balancing is that the load factor is considered in the actual path cost based on Dueling DQN algorithm. I.e. the combination of the running distance and the road network load, the setting of the punishment function can directly influence the running efficiency of the AGV in the road network.
Step S42: after the reward and punishment functions are designed, a Dueling DQN deep reinforcement learning algorithm is used for enabling a plurality of AGVs to interact with the road network environment, a high-load area in the road network is optimized, and the overall passing efficiency of the intelligent AGVs is improved;
after the design of the reward and punishment function is completed, the running times of the AGV are set to be 100t (t epsilon N +) times of task quantity. And recording the path length of each AGV running in the road network, and extracting the road network load of each point in the map in the dispatching system. The essence of AGV load balancing is to take the load factor into account the actual path cost. I.e. the combination of the running distance and the road network load, the setting of the punishment function can directly influence the running efficiency of the AGV in the road network.
When an AGV load balancing experiment is carried out in a road network, the starting point and the end point of 100t AGV tasks are kept unchanged. And comparing the road network load conditions before and after load balancing, and respectively measuring and calculating the shortest running time of vehicle dispatching before and after balancing. The delta and eta values are tested for multiple times in the experiment to obtain an optimal value, and when the delta and eta values are both 0, the optimal value is the condition before load balancing. Firstly, one AGV operates in a road network, and then the load data of the current operation is updated. The same method is used by the following AGVs until all AGVs have updated the road network load.
Step S43: executing the current action A in the state S to obtain the feature vector corresponding to the new state SRewards R and whether to terminate the state end, and will {, A, R, End five-tuple is stored in experience playback set D;
And randomly selecting an initial state S, selecting an action A from all actions in the current state S to obtain a next state S', wherein each step of action is executed by the intelligent AGV, the system can score the action, the AGV can obtain a reward value, and finally, after a target task is achieved, the rewards obtained in each step are added, so that the finally obtained target rewards are obtained. Based on the state S ', a training module is built in the Q network by the model and used as production data to be input into the next step, the model can carry out real-time self-adaptive scheduling, a plurality of scheduling strategy actions A ' of production jobs are selected, and the scheduling strategy actions A ' are optimal actions selected based on the real-time state, so that an AGV can walk out of an optimal route in the scheduling process.
The state cost function and the dominance function are respectively modeled by combining the designed punishment and punishment functions, so that the AGV only pays attention to the value of the state in certain situations and does not pay attention to the difference caused by different actions, and the state and the dominance function are modeled separately at the moment, so that the state which is less relevant to the actions can be better processed by an agent. When there is no vehicle in front of the running AGV, the vehicle itself is not much different, at which time the AGV is more concerned with status value, while when there is a car in front of the AGV (the agent needs to overtake), the agent begins to pay attention to the difference of different action dominance values. At this time, action A is executed in the current state S, and feature vectors corresponding to the new state S' are sequentially obtainedThe prize R and whether the end state is terminated, and finally these values are stored in the set D.
In the step, each AGV generates a series of actions and environments to interact through an intelligent agent, obtains the state and rewards of the next moment after taking actions in the current state, then stores the state and rewards in a memory bank D of the step, randomly extracts a certain number of memories from the memory bank D as samples after accumulating a certain number of steps for memory, and learns; when the AGV reaches the final end point through continuous iteration, giving the AGV an end point reward according to a set reward and punishment function, wherein before the AGV walks to the end point, each step has single-step punishment, and a plurality of AGV position punishments and trap punishments, and setting the punishment values is also used for enabling the AGV to make mistakes as little as possible, walk to the end point as soon as possible, and obtain the maximum reward; the discount factor gamma is related to the time domain and is set for the AGV to obtain the maximum reward as soon as possible; action selection using the epsilon-greedy method, the initial value epsilon cannot generally be 0, as epsilon is not calculated to be inside the final average value, and epsilon generally decreases gradually over time.
As shown in fig. 5: task scheduling structure based on Dueling DQN algorithm, in one embodiment, the step S5 includes: sampling m samples from the experience playback set D to calculate a current target Q value y j, and updating all parameters w of the Q network by gradient back propagation of the neural network by using a mean square error loss function;
Step S51: sampling m samples in the experience playback set D to calculate a current target Q value y j;
And calculating Q (S, A, w, alpha, beta), judging whether the obtained Q matrix is converged or not, if not, returning to re-calculation, and if so, carrying out the next step to finish the learning of the target Q value. Wherein s1, s2, s3, s4 respectively represent the state of each time of the intelligent AGV, and a1, a2, a3, a4 respectively represent the corresponding action generated by each state. And finally judging whether the obtained Q value is the choice under the optimal action, and completing the calculation of the target Q value.
Generally, when the Q value is estimated by using a convolutional neural network, the Q value is overestimated, according to the formulaThe state-cost function is deformed as shown in the following formula:
Quantitative analysis Q w-(s, a)-V* for overestimation of the Q value obeys the uniform independent same distribution between [ -1, 1] and the action space size is set as h, then for any state s:
The estimation error is noted as σ=q w-(s, a)-maxa'Q* (s, a'), since the estimation error is independent for different actions, there are:
P (σ a.ltoreq.x) is a cumulative distribution function of σ a, which can be written specifically as:
Thus, the cumulative distribution function for max aσa is found as follows:
the final deformation can be obtained:
In an actual road network environment, when AGV scheduling based on Dueling DQN algorithm calculates a current target Q value, the Q value can be increased along with the increase of the action space h, and in an environment with more action choices, the Q value can have the problem of overestimation, and based on the problem, the selection of the action space is calculated with emphasis, so that the model can find the Q value which makes the AGV scheduling time shortest when estimating the Q value.
Step S52: updating all parameters w of the Q network through gradient back propagation of the neural network by using a mean square error loss function;
Training and updating according to the designed state function and the designed advantage function and the following formula:
Wherein the method comprises the steps of
In the gradient calculation process, in order to make the dominance function of the neural network zero, the estimated dominance function and the controlled dominance function are used to obtain a zero equation:
at this time, for the case of policy evaluation, the expression of the Q value is as in the formula of step S33 As shown, for the optimal control case, the improvement is as follows:
After the improvement is finished, the optimal action can be accurately learned by the V (S; w, beta), so that the advantage function AF (S, A'; w, alpha) is ensured to be more accurate.
In one embodiment, in the step S6, if S' is in the end state, the steps S2 to S5 are repeated until the task is completed, including:
When the step S6 is completed, if the state S' is a termination state or the loss function is too large to enable the model to converge, the steps S2-S5 are continuously repeated; if the operation is not in the termination state, the operation is completed, and finally, the shortest scheduling time under different requirements is respectively compared through different requirements to obtain the optimal strategy.
The above examples are provided for the purpose of describing the present invention only and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalents and modifications that do not depart from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (3)

1. A multi-AGV load balancing and task scheduling method based on Dueling DQN algorithm is characterized by comprising the following steps:
Step S1: collecting intelligent warehouse workshop operation data, preprocessing, constructing a Markov decision model, randomly initializing all states and values Q corresponding to actions, initializing all network parameters, emptying a experience playback set D, and completing data state modeling; randomly extracting data after state modeling, initializing a state S as a first state of a current state sequence, and acquiring a characteristic vector phi (S) of the data;
step S2: using phi(s) as input in a Q network, adding two sub-network structures in front of an output layer of a neural network, and linearly adding an output action cost function Q by a state-based cost function V and a state-based dominance function AF;
Step S3: selecting a corresponding action A in the current Q value output by using an epsilon-greedy method to obtain time T spent on scheduling based on the action A, storing the time T in a set, and respectively evaluating the optimal actions when the time T is the same and different; the step S3 includes:
step S31: selecting a corresponding action A in the current Q value output by using an epsilon-greedy method;
Step S32: based on step S31, time T spent for scheduling by action a is obtained and stored in a collection;
When a new task arrives or the AGV completes work, triggering scheduling; then, after processing the complex real-time information, the system state module extracts key state information which is divided into task information and AGV state information; the status information is sent to a work module, which constitutes the computational core of the system; the Q network module carries out continuous training and learning on the input state by using Dueling DQN algorithm and outputs the result to the combined action module; finally, feeding back the selected rule and the AGV to a dispatching system to be used as a command for guiding the AGV to dispatch in real time;
Step S33: based on the step S32, judging the time T spent by AGV scheduling in the set, and when the time T is different, selecting the corresponding scheduling strategy action A as the optimal action when the time T is the least;
step S34: based on step S32, determining the time T spent by the AGV scheduling in the set, and when the time T is the same, selecting the time T with the maximum action evaluation value Q as the optimal action;
Step S4: designing a reward and punishment function, combining the path length and the road network load with the reward and punishment function to balance the road network load, executing the current action A in the state S to obtain a characteristic vector phi (S ') corresponding to the new state S ', a reward R and whether to terminate the state end, and storing the { phi (S), A, R, phi (S ') end five-tuple into an experience playback set D; the step S4 includes:
Step S41: designing a reward and punishment mechanism function, combining path length and road network load with the reward and punishment function, considering load factors in the reward and punishment function in the deep reinforcement learning iterative process, and setting the function as follows:
In the above formula r t, δ∈ (0, 1) is a path length coefficient, d t (x) is the sum of paths travelled by the AGV, d t(x)=d1+d2+···+dt, t∈ (1, n), η∈ (0, 1) is a Load coefficient, and Load (x) is the number of vehicles passing by the current node in the road network;
Step S42: after the reward and punishment functions are designed, a Dueling DQN deep reinforcement learning algorithm is used for enabling a plurality of AGVs to interact with the road network environment, a high-load area in the road network is optimized, and the overall passing efficiency of the intelligent AGVs is improved;
step S43: executing the current action A in the state S to obtain a feature vector phi (S '), a reward R and whether to terminate the state end corresponding to the new state S ', and storing { phi (S), A, R, phi (S ') end } five-tuple into an experience playback set D;
Randomly selecting an initial state S, selecting an action A from all possible actions in the current state S to obtain a next state S', wherein each step of action is executed by the intelligent AGV, the system can score the action, the AGV can obtain a reward value, and after a target task is finally achieved, the rewards obtained in each step are added to obtain the target reward finally; based on the state S ', a training module is built in a Q network by the model and is used as production data to be input into the next step, the model can carry out real-time self-adaptive scheduling, a plurality of scheduling strategy actions A ' of production jobs are selected, and the scheduling strategy actions A ' are optimal actions selected based on the real-time state, so that an AGV can walk out of an optimal route in the scheduling process;
Step S5: sampling m samples from the experience playback set D to calculate a current target Q value y j, and updating all parameters w of the Q network by gradient back propagation of the neural network by using a mean square error loss function; the step S5 includes:
Step S51: sampling m samples in the experience playback set D to calculate a current target Q value y j;
the state-cost function is deformed as shown in the following formula:
Quantitative analysis Q w-(s,a)-V* for overestimation of the Q value obeys the uniform independent same distribution between [ -1,1] and the action space size is set as h, then for any state s:
the estimation error is noted as σ=q w-(s,a)-maxa'Q* (s, a'), since the estimation error is independent for different actions, there are:
p (σ a.ltoreq.x) is the cumulative distribution function of σ a, which is written specifically as:
Thus, the cumulative distribution function for max aσa is found as follows:
And finally, deforming to obtain:
In an actual road network environment, when AGV scheduling based on Dueling DQN algorithm calculates a current target Q value, the Q value can be increased along with the increase of the action space h, and in an environment with more action selection numbers, the Q value can have the problem of overestimation; based on this, the selection of the action space is calculated to ensure that the model finds the Q value that minimizes the AGV schedule time when estimating the Q value;
step S52: updating all network parameters w of the Q network by gradient back propagation of the neural network by using a mean square error loss function;
Training and updating according to the designed state function and the designed advantage function and the following formula:
Wherein the method comprises the steps of
yi=r+γmaxa′Q(s′,a′;w-)
In the gradient calculation process, in order to make the dominance function of the neural network zero, the estimated dominance function and the controlled dominance function are used to obtain a zero equation:
At this time, for the case of policy evaluation, the expression of the Q value is shown as the formula of step S33, and for the case of optimal control, the following expression is improved:
Q(S,A;w,α,β)=V(S;w,α)+(AF(S,A;w,β)-maxa′∈ΛAF(S,a′;w,β))
after the improvement is finished, the optimal action of the reverse force V (S; w, alpha) is accurately learned, so that the advantage function AF (S, a'; w, beta) is ensured to be more accurate;
Step S6: the state S' is a termination state or the loss function is too large to enable the model to be converged, and the steps S2 to S5 are continuously repeated; if the operation is not in the termination state, the operation is completed, and finally, the shortest scheduling time under different requirements is respectively compared through different requirements to obtain the optimal strategy.
2. The method for balancing loads and scheduling tasks of multiple AGVs based on Dueling DQN algorithm as claimed in claim 1, wherein said step S1 includes:
step S11: preprocessing and classifying the acquired AGV data, defining a strategy pi based on an established Markov decision model and formulating a cost function;
Establishing a Markov decision model, wherein the definition of the Markov decision process is < S, A, P, R, gamma >, the state S represents the set of all states S, the action A represents all sets related to the action a in the decision process, the P represents the conditional probability of selecting the action A when the state S is present, and the R represents the accumulated return; the final goal is to achieve prize maximization; the transmission probability matrix and the reward function are used for defining, and the specific formula is as follows:
After defining the transmission probability matrix and the reward function, the definition of the policy pi is as follows:
π(a|s)=P(At=a|St=s)
a state cost function and an action cost function, as shown in the following equation:
v π(s) represents the expectation of benefit in the state s, representing the value brought by the state; q π (s, a) represents the expectation of revenue after taking action a in state s, representing the value brought by the action; g t represents the sum of accumulated rewards generated by the agent when interacting with the environment;
Step S12: completing state modeling of data and solving a Belman equation by using value iteration;
Classifying the collected production operation data, wherein the target object state is the target object 1 state, the target object 2 state, the target object 3 state, … … and the target object i state, converting a value function based on a strategy by utilizing a Bellman equation, solving the equation through value iteration, and solving an optimal solution;
Step S13: randomly initializing all network parameters w of a Q network, initializing values Q corresponding to all states and actions based on w, and emptying a set D of experience playback;
step S14: the initialization state S is the first state of the current state sequence, and the feature vector phi (S) is acquired.
3. The method for balancing loads and scheduling tasks of multiple AGVs based on the Dueling DQN algorithm according to claim 1, wherein the step S2 comprises:
Step S21: building a convolutional neural network model, and inputting phi(s) into the data by an input layer to perform deep learning training;
Importing the acquired road network data, and then starting defining a model; the model sets two convolution layers and a pooling layer, three Dropout layers, a flattening layer and two full connection layers; the method comprises the steps that a ReLU activation function is used as an activation function in a convolution layer, the ratio of a first Dropout layer to a second Dropout layer is 0.25, the ratio of a third Dropout layer is 0.5, the ReLU function is used as a first full-connection layer activation function, and the softmax function is used as a second full-connection layer activation function; kernel_size represents the convolution Kernel size, padding represents zero padding to the same size, strides represents the padding stride; in the compiling model, the loss function is loss, the evaluation standard is accuracy, in the training model, the verification set is 20% of the training set, the training period is 30 times, and the batch size is 128; the formula of the convolutional neural network is shown as follows: wherein Y represents an output value, f represents an activation function, w represents a network parameter, x represents an input value, and b represents an offset;
Y=f(w*x+b)
step S22: respectively establishing a sub-network structure model: a cost function V and a dominance function AF;
the Dueling DQN algorithm considers the Q network to be divided into two parts, the first part being related to the state S only, independent of the action a to be adopted specifically, this part being called the cost function part, denoted V (S; w, α), the second part being related to both the state S and the action a, this part being called the dominance function part, denoted AF (S, a; w, β), the final cost function being expressed again as follows:
Q(S,A;w,α,β)=V(S;w,α)+AF(S,A;w,β)
where w is the network parameter of the Q network, α is the network parameter of the unique part of the cost function, and β is the network parameter of the unique part of the dominance function; the Dueling DQN algorithm divides the abstract features extracted by the convolution layer into two branches; one path represents a state value function V and represents the value of a static state environment; the other path represents an action dominance function AF of the dependent state, and represents the value additionally brought by selecting a certain action; finally, the two paths are aggregated together to obtain the Q value of each action, so that the AGV is better adapted to different environments;
step S23: after training data is output through the full connection layer, two sub-network structures are added in front of the output layer: the cost function V and the dominant function AF are subjected to linear addition and output;
After convolution and pooling operation, sequentially expanding the obtained feature graphs according to rows, connecting the feature graphs into vectors, inputting the vectors into a fully-connected network, respectively solving loss functions of a training set and a testing set to evaluate a model, and training by using a gradient descent and back propagation method;
in the output process, a cost function is obtained, but the equation cannot identify the V (S; w, alpha) and AF (S, A; w, beta) in the final output, and in order to embody the identifiability, the dominant function part is subjected to centering treatment, and the actually used combination formula is shown as follows:
CN202311805708.XA 2023-12-26 2023-12-26 Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method Active CN117474295B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311805708.XA CN117474295B (en) 2023-12-26 2023-12-26 Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311805708.XA CN117474295B (en) 2023-12-26 2023-12-26 Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method

Publications (2)

Publication Number Publication Date
CN117474295A CN117474295A (en) 2024-01-30
CN117474295B true CN117474295B (en) 2024-04-26

Family

ID=89625958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311805708.XA Active CN117474295B (en) 2023-12-26 2023-12-26 Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method

Country Status (1)

Country Link
CN (1) CN117474295B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117952397B (en) * 2024-03-21 2024-05-31 深圳市快金数据技术服务有限公司 Logistics order analysis method, device, equipment and storage medium
CN117973635B (en) * 2024-03-28 2024-06-07 中科先进(深圳)集成技术有限公司 Decision prediction method, electronic device, and computer-readable storage medium

Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109725988A (en) * 2017-10-30 2019-05-07 北京京东尚科信息技术有限公司 A kind of method for scheduling task and device
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
KR20200095590A (en) * 2019-01-21 2020-08-11 한양대학교 산학협력단 Method and Apparatus for Controlling of Autonomous Vehicle using Deep Reinforcement Learning and Driver Assistance System
CN112149347A (en) * 2020-09-16 2020-12-29 北京交通大学 Power distribution network load transfer method based on deep reinforcement learning
CN112423400A (en) * 2020-11-20 2021-02-26 长春工业大学 Ethernet communication link scheduling method based on improved firework algorithm
CN113792924A (en) * 2021-09-16 2021-12-14 郑州轻工业大学 Single-piece job shop scheduling method based on Deep reinforcement learning of Deep Q-network
CN114025330A (en) * 2022-01-07 2022-02-08 北京航空航天大学 Air-ground cooperative self-organizing network data transmission method
CN114489059A (en) * 2022-01-13 2022-05-13 沈阳建筑大学 Mobile robot path planning method based on D3QN-PER
CN114692310A (en) * 2022-04-14 2022-07-01 北京理工大学 Virtual-real integration-two-stage separation model parameter optimization method based on Dueling DQN
CN115047878A (en) * 2022-06-13 2022-09-13 常州大学 DM-DQN-based mobile robot path planning method
CN115239072A (en) * 2022-06-23 2022-10-25 国网河北省电力有限公司保定供电分公司 Load transfer method and device based on graph convolution neural network and reinforcement learning
CN115333143A (en) * 2022-07-08 2022-11-11 国网黑龙江省电力有限公司大庆供电公司 Deep learning multi-agent micro-grid cooperative control method based on double neural networks
CN116406004A (en) * 2023-04-06 2023-07-07 中国科学院计算技术研究所 Construction method and resource management method of wireless network resource allocation system
CN116448117A (en) * 2023-04-18 2023-07-18 安徽大学 Path planning method integrating deep neural network and reinforcement learning method
CN116527567A (en) * 2023-06-30 2023-08-01 南京信息工程大学 Intelligent network path optimization method and system based on deep reinforcement learning
CN116562332A (en) * 2023-07-10 2023-08-08 长春工业大学 Robot social movement planning method in man-machine co-fusion environment
CN116644902A (en) * 2023-04-21 2023-08-25 浙江工业大学 Multi-target dynamic flexible job shop scheduling method related to energy consumption based on deep reinforcement learning
CN116755409A (en) * 2023-07-04 2023-09-15 中国矿业大学 Coal-fired power generation system coordination control method based on value distribution DDPG algorithm
CN116797116A (en) * 2023-06-15 2023-09-22 长春工业大学 Reinforced learning road network load balancing scheduling method based on improved reward and punishment mechanism
CN116859731A (en) * 2023-07-03 2023-10-10 中车长春轨道客车股份有限公司 Method for enhancing punctuality of high-speed rail automatic driving control system based on reinforcement learning
CN117009876A (en) * 2023-10-07 2023-11-07 长春光华学院 Motion state quantity evaluation method based on artificial intelligence
CN117010482A (en) * 2023-07-06 2023-11-07 三峡大学 Strategy method based on double experience pool priority sampling and DuelingDQN implementation
CN117116064A (en) * 2023-07-04 2023-11-24 华北水利水电大学 Passenger delay minimization signal control method based on deep reinforcement learning
CN117194057A (en) * 2023-11-08 2023-12-08 贵州大学 Resource scheduling method for optimizing edge energy consumption and load based on reinforcement learning
CN117240636A (en) * 2022-11-15 2023-12-15 北京邮电大学 Data center network energy saving method and system based on reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210141355A1 (en) * 2019-11-07 2021-05-13 Global Energy Interconnection Research Institute Co. Ltd Systems and methods of autonomous line flow control in electric power systems

Patent Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109725988A (en) * 2017-10-30 2019-05-07 北京京东尚科信息技术有限公司 A kind of method for scheduling task and device
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
KR20200095590A (en) * 2019-01-21 2020-08-11 한양대학교 산학협력단 Method and Apparatus for Controlling of Autonomous Vehicle using Deep Reinforcement Learning and Driver Assistance System
CN112149347A (en) * 2020-09-16 2020-12-29 北京交通大学 Power distribution network load transfer method based on deep reinforcement learning
CN112423400A (en) * 2020-11-20 2021-02-26 长春工业大学 Ethernet communication link scheduling method based on improved firework algorithm
CN113792924A (en) * 2021-09-16 2021-12-14 郑州轻工业大学 Single-piece job shop scheduling method based on Deep reinforcement learning of Deep Q-network
CN114025330A (en) * 2022-01-07 2022-02-08 北京航空航天大学 Air-ground cooperative self-organizing network data transmission method
CN114489059A (en) * 2022-01-13 2022-05-13 沈阳建筑大学 Mobile robot path planning method based on D3QN-PER
CN114692310A (en) * 2022-04-14 2022-07-01 北京理工大学 Virtual-real integration-two-stage separation model parameter optimization method based on Dueling DQN
CN115047878A (en) * 2022-06-13 2022-09-13 常州大学 DM-DQN-based mobile robot path planning method
CN115239072A (en) * 2022-06-23 2022-10-25 国网河北省电力有限公司保定供电分公司 Load transfer method and device based on graph convolution neural network and reinforcement learning
CN115333143A (en) * 2022-07-08 2022-11-11 国网黑龙江省电力有限公司大庆供电公司 Deep learning multi-agent micro-grid cooperative control method based on double neural networks
CN117240636A (en) * 2022-11-15 2023-12-15 北京邮电大学 Data center network energy saving method and system based on reinforcement learning
CN116406004A (en) * 2023-04-06 2023-07-07 中国科学院计算技术研究所 Construction method and resource management method of wireless network resource allocation system
CN116448117A (en) * 2023-04-18 2023-07-18 安徽大学 Path planning method integrating deep neural network and reinforcement learning method
CN116644902A (en) * 2023-04-21 2023-08-25 浙江工业大学 Multi-target dynamic flexible job shop scheduling method related to energy consumption based on deep reinforcement learning
CN116797116A (en) * 2023-06-15 2023-09-22 长春工业大学 Reinforced learning road network load balancing scheduling method based on improved reward and punishment mechanism
CN116527567A (en) * 2023-06-30 2023-08-01 南京信息工程大学 Intelligent network path optimization method and system based on deep reinforcement learning
CN116859731A (en) * 2023-07-03 2023-10-10 中车长春轨道客车股份有限公司 Method for enhancing punctuality of high-speed rail automatic driving control system based on reinforcement learning
CN116755409A (en) * 2023-07-04 2023-09-15 中国矿业大学 Coal-fired power generation system coordination control method based on value distribution DDPG algorithm
CN117116064A (en) * 2023-07-04 2023-11-24 华北水利水电大学 Passenger delay minimization signal control method based on deep reinforcement learning
CN117010482A (en) * 2023-07-06 2023-11-07 三峡大学 Strategy method based on double experience pool priority sampling and DuelingDQN implementation
CN116562332A (en) * 2023-07-10 2023-08-08 长春工业大学 Robot social movement planning method in man-machine co-fusion environment
CN117009876A (en) * 2023-10-07 2023-11-07 长春光华学院 Motion state quantity evaluation method based on artificial intelligence
CN117194057A (en) * 2023-11-08 2023-12-08 贵州大学 Resource scheduling method for optimizing edge energy consumption and load based on reinforcement learning

Non-Patent Citations (15)

* Cited by examiner, † Cited by third party
Title
Deep Reinforcement Learning Based Dynamic Route Planning for Minimizing Travel Time;Yuanzhe Geng;2021 IEEE International Conference on Communications Workshops (ICC Workshops);20210709;全文 *
Deep reinforcement learning for dynamic scheduling of energy-efficient automated guided vehicles;Lixiang Zhang;Journal of Intelligent Manufacturing;20231002;全文 *
Dueling Network Architectures for Deep Reinforcement Learning;Ziyu Wang;https://arxiv.org/pdf/1511.06581v3.pdf;20160415;全文 *
Scheduling of decentralized robot services in cloud manufacturing with deep reinforcement learning;Yongkui Liu;Robotics and Computer-Integrated Manufacturing;20220925;全文 *
一种优化参数的支持向量机驾驶意图识别;李慧;李晓东;宿晓曦;;实验室研究与探索;20180215(第02期);全文 *
吴夏铭.基于深度强化学习的路径规划算法研究.中国优秀硕士学位论文 工程科技辑.全文. *
基于Q学习的多无人机协同航迹规划方法;尹依伊;兵工学报;20220613;全文 *
基于值函数和策略梯度的深度强化学习综述;刘建伟;高峰;罗雄麟;;计算机学报;20181022(第06期);全文 *
基于多传感器信息融合的AGV同时定位与建图研究;朱绪康;中国优秀硕士学位论文 信息科技辑;20210115;全文 *
基于强化学习多无人机路径规划算法研究及实现;于盛;中国优秀硕士学位论文 工程科技Ⅱ辑;20210115;全文 *
基于强化学习的DASH自适应码率决策算法研究;冯苏柳;姜秀华;;中国传媒大学学报(自然科学版);20200425(第02期);全文 *
基于深度Q网络的水面无人艇路径规划算法;随博文;黄志坚;姜宝祥;郑欢;温家一;;上海海事大学学报;20200930(第03期);全文 *
基于深度强化学习的AGV路径规划;郭心德;万方数据;20211015;第12~13,20~21,27~31, 36~39,44~45,53~54页 *
基于深度强化学***;;计算机与数字工程;20200620(第06期);全文 *
基于离散粒子群算法的工作流任务调度研究;刘环宇;侯秀萍;;计算机技术与发展;20100510(第05期);全文 *

Also Published As

Publication number Publication date
CN117474295A (en) 2024-01-30

Similar Documents

Publication Publication Date Title
CN117474295B (en) Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method
CN109492814B (en) Urban traffic flow prediction method, system and electronic equipment
Papakostas et al. Towards Hebbian learning of fuzzy cognitive maps in pattern classification problems
CN107909179B (en) Method for constructing prediction model of running condition of plug-in hybrid vehicle and vehicle energy management method
CN112596515B (en) Multi-logistics robot movement control method and device
CN111612243A (en) Traffic speed prediction method, system and storage medium
CN110014428B (en) Sequential logic task planning method based on reinforcement learning
CN112949828A (en) Graph convolution neural network traffic prediction method and system based on graph learning
CN103544496A (en) Method for recognizing robot scenes on basis of space and time information fusion
CN113362637B (en) Regional multi-field-point vacant parking space prediction method and system
CN113537365B (en) Information entropy dynamic weighting-based multi-task learning self-adaptive balancing method
CN114004383A (en) Training method of time series prediction model, time series prediction method and device
CN114519433A (en) Multi-agent reinforcement learning and strategy execution method and computer equipment
Zhang et al. A PSO-Fuzzy group decision-making support system in vehicle performance evaluation
Yang et al. Learning customer preferences and dynamic pricing for perishable products
Ming et al. Cooperative modular reinforcement learning for large discrete action space problem
CN112686693A (en) Method, system, equipment and storage medium for predicting marginal electricity price of electric power spot market
CN116797116A (en) Reinforced learning road network load balancing scheduling method based on improved reward and punishment mechanism
CN116151581A (en) Flexible workshop scheduling method and system and electronic equipment
Elsayed et al. Deep reinforcement learning based actor-critic framework for decision-making actions in production scheduling
CN115599296A (en) Automatic node expansion method and system for distributed storage system
CN115238789A (en) Financial industry special data prediction method and system based on improved GRU
Paudel Learning for robot decision making under distribution shift: A survey
CN113807005A (en) Bearing residual life prediction method based on improved FPA-DBN
Pavón et al. A model for parameter setting based on Bayesian networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant