CN116644902A

CN116644902A - Multi-target dynamic flexible job shop scheduling method related to energy consumption based on deep reinforcement learning

Info

Publication number: CN116644902A
Application number: CN202310433751.1A
Authority: CN
Inventors: 郦仕云; 周忠警; 裴植
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2023-08-25

Abstract

The invention discloses a multi-target dynamic flexible job shop scheduling method about energy consumption based on deep reinforcement learning, which mainly comprises the following steps: (1) Constructing a high-level deep reinforcement learning network and a low-level deep reinforcement learning network, wherein the high-level network can control the decision of the low-level network; (2) Obtaining state characteristics of the current scheduling moment, inputting the state characteristics into a high-level deep reinforcement network for training and learning, taking the obtained temporary optimization target value as a reward value, and entering the next state characteristics; (3) The current state characteristics, the optimized target value and the next state characteristics form an experience value, a plurality of experience values are extracted and put into a high-level network, and parameters in the network are optimized; (4) Inputting high-level optimization targets and state characteristics into a low-level network, and optimizing parameters in the network; (5) And selecting proper workpieces and processing machines according to the obtained optimization target value, so that the final dispatching optimization target meets the requirement.

Description

Multi-target dynamic flexible job shop scheduling method related to energy consumption based on deep reinforcement learning

Technical Field

The invention belongs to the application of machine learning technology in workshop scheduling, and particularly relates to a multi-target dynamic flexible job workshop scheduling method based on deep reinforcement learning for energy consumption.

Background

In factory machining, workshop scheduling is often an important part of factories, in most workshop scheduling, workshop scheduling problems are often studied by expert students, and solving the problems is proved to be NP difficult, and the problem is mainly that information of each working procedure of each workpiece on a machining machine is given, so that the sequence of machining different working procedures on different machines needs to be determined, while flexible workshop scheduling is complicated on the basis of workshop scheduling, and not only the machining sequence of the different working procedures on the machines, but also the machining of the different working procedures of the different workpieces on proper machines needs to be determined. In the actual processing process of the factory, special accidents often happen, so that the normal operation rhythm of the factory is disturbed. For example, when a new workpiece is inserted in an emergency, a processing machine is out of order, a delivery date is changed, etc., the workshop scheduling with emergency is called dynamic workshop scheduling, and the traditional static workshop scheduling without accompanying special conditions may not be suitable for the complex and changeable environment of the existing factory.

In the workshop scheduling of a single goal, only the required goal is needed to reach the optimal state, but the goal is often achieved at the expense of other goals, for example, the energy consumption generated by a machine may be too huge, and the environment is polluted. At the present time, the green development is promoted, and the requirements for processing time are also required to minimize the energy consumption of the machine during processing as much as possible.

When solving the workshop scheduling problem, a relatively common scheduling rule, such as preferentially selecting first-in first-out workpieces, preferentially selecting the workpieces with the longest residual processing time and the like, is often adopted, and the relatively simple scheduling rule can only realize short-time benefits, can not realize long-term benefits and has no good effect on solving the multi-objective problem.

Meta-heuristic algorithms are also often used in solving shop scheduling problems, such as genetic algorithms, simulated annealing algorithms, tabu search algorithms, and the like. The algorithms can uniformly consider unfinished workpieces and find an optimal solution by utilizing algorithm solution, but the method has huge calculation amount and can not be applicable under the condition of dynamic scheduling.

When solving a workshop scheduling problem, machine Learning which is developed more mature nowadays is often used for the workshop scheduling problem, reinforcement Learning is used as one field of machine Learning and is used for the problem, and the most common is a Q-Learning method, wherein an agent builds a huge Q table according to states and selected actions when interacting with an external environment. Each horizontal line of the table represents each state, each vertical line represents each action, the Q value obtained by adopting a certain action in a certain state is stored in the table, when proper actions are selected each time, only the action corresponding to the largest Q value in the state is selected at the scheduling time.

Disclosure of Invention

Aiming at the problems in the prior art, the invention mainly aims at the problems of a dynamic multi-target flexible job shop, adopts a deep reinforcement learning method, and not only applies the perception ability of deep learning on visual application, but also comprises the decision ability of reinforcement learning when solving the problems. At different scheduling moments, selecting proper workpieces and machines as the workpieces and machines to be processed according to different state characteristics generated at different scheduling moments, so that the proposed optimization target can be optimized.

To achieve the above object, the present invention providesA dynamic flexible job shop scheduling environment is provided, and a deep reinforcement learning method is provided for solving. The content comprises: at a certain scheduling time t of workshop scheduling, an agent of the high-level deep reinforcement learning network obtains a current state feature s _t And inputting the state characteristics into a trained DDQN (namely a high-level deep reinforcement learning network), obtaining a scheduling target at the scheduling time t, and inputting the scheduling target and the state characteristics into a low-level trained D3QN (namely a low-level deep reinforcement learning network), wherein experience values in the high-level and low-level networks are selected according to a mode of storing experience values in priority, and parameters in each network are updated. And then the low-level intelligent body selects a proper workpiece to be processed and a proper machine to be processed at the scheduling moment according to the state characteristics and the scheduling target, and arranges the next working procedure of the workpiece to be processed on the machine to be processed for processing.

In applying deep reinforcement learning, the present invention uses a Hierarchical Deep Reinforcement Learning (HDRL) approach, which is one of deep reinforcement learning that can handle environments of higher dimensions and more complex tasks. HDRL is largely divided into two hierarchical networks, a High-level network and a Low-level network. The agent decomposes a task with higher complexity into a series of simpler subtasks, then solves the problem of each subtask, and can reduce the complexity of the task and improve the learning efficiency by adopting the processing mode of the layered structure. In high-level network learning, the agent selects the best learning strategy, so that it can learn quickly in the face of new problems. The learning of the low-level network is mainly to maximize the achieved jackpot under the external environment according to the state characteristic selection action. For example, when an agent is playing a new game, the higher level agent can quickly learn the rules of the game and formulate corresponding strategies so as to obtain higher learning effects in a shorter time; the low-level agent selects the best action according to the current state characteristics in the game environment, so that the game can reach higher score. When a complex task is completed, the learning of two layers of networks is combined to be used, so that a stronger reinforcement learning intelligent agent is constructed.

The ordinary reinforcement learning has good performance effect on the aspect of simple processing, but when the problem of more complex workshop scheduling is faced, the reinforcement learning records state characteristics and actions in a Q table to obtain corresponding reward function values, and huge calculation amount reduces the calculation efficiency of an intelligent agent and reduces the learning effect. Therefore, a deep learning method is introduced, and the deep learning is performed in the neural network in the calculation process, so that the situation that a large amount of data occupy the memory is avoided, and the calculation efficiency can be greatly improved. The reinforcement learning is effectively combined with the deep learning to form the Deep Reinforcement Learning (DRL), so that the calculation efficiency is improved, and the learning effect is improved.

When the state characteristics s of the scheduling moment are obtained by the agent during ordinary reinforcement learning solution, a reward Q (s, a) value is obtained through calculation by selecting a proper action a and recorded in a Q table full of Q values, and the obtained data and calculated amount are increased in geometric multiples along with the continuous increase of the state characteristics and actions, so that the calculation efficiency is continuously reduced. To solve the problem, a Deep Q Network (DQN) is introduced, which can be network-fed by taking the state characteristics as neurons, by calculating the output Q value inside the neural network, which calculation does not generate additional memory, the output Q value Q (s, a) of which is approximately equal to the actual prize value f(s), i.e. f(s) ≡q (s, a). The deep network neuron structure is shown in fig. 3.

The reinforcement learning is learning in interaction with the environment, different rewards and punishments are obtained through different action strategies according to the perceived states in the environment, and the accumulated rewards are maximized after continuous learning, so that an agent is guided to select the most suitable strategy. Reinforcement learning may be represented as a markov decision process having a state S, an action a, a prize value R, and a discount rate γ, as in fig. 2.

At time point t, the agent is in state S _t Next, select action A _t Obtaining the reward value R _t After that, the next state S is obtained _t+1 And gets the next prize R _t+1 . At time point t, state S _t Including a plurality of state features s _t Action A _t Comprising a plurality of actions a _t Prize value R _t Including prize values r _t Next state S _t+1 Including a plurality of s _t+1 The generated plurality of experiences (s _t ,a _t ,r _t ,s _t+1 ) And storing, and obtaining a proper strategy by the intelligent agent according to experience.

In a common DQN algorithm, the agent depends on the state characteristics s in the environment _t Selecting an action a _t Obtain the prize value r _t The next state s _t+1 And a flag do for judging whether the round is ended, where do represents that the round is ended, notmo represents that the round is not ended, and s is set to _t 、a _t 、r _t 、s _t+1 And do constitute an experience (s _t ,a _t ,r _t ,s _t+1 Do), the experience is applied to make the calculation. In the calculation process, one group of data is discarded every time, which causes waste of the data, and an empirical replay method is adopted to solve the problem. First, an experience pool D with a fixed capacity N is built in the network for storing the obtained (s _t ,a _t ,r _t ,s _t+1 Do) empirical samples. When the data stored in the experience pool is full, the intelligent agent randomly selects a small batch of sample experiences in the experience pool, calculates the rewarding value of the sample experiences, and selects the experience with the largest rewarding value so as to update the weight value theta of the DQN. In the calculation process, the DQN uses two networks, an evaluation network Q and a target networkFirstly, parameters and models of two networks are set to be the same, and as learning is continuously carried out, the weight value theta of the evaluation network is only updated every time in every learning step C, and no change is carried out on the target network. When a certain fixed learning step C is reached, the existing weight of the evaluation network is copied to the target network, the other steps do not change the target network, and the generated Q value is kept unchanged in the period of time when the target network is unchanged, so that the divergence of parameters is avoided, and the target network is enabled to be unchangedThe target network value remains converged and stable. According to the weight value theta of each step in the evaluation network ^- And calculating the target Q value, wherein a calculation formula is shown in a formula (1).

Although conventional DQN has shown good results, it still performs poorly in more complex situations. Therefore, further improvements are needed to improve their performance. When sampling experience, some sample experience is randomly sampled every time, which may lead to that excellent experience cannot be obtained; when the network Q value is calculated each time, the network Q value is selected according to a max method, which possibly leads to overestimation of the target Q value and influences the learning efficiency; in the output state-action Q value, it may occur that in some states, the corresponding Q value is independent of action, it is inappropriate to use a single action value, and so on. There is a need to optimize DQN in combination with these improvements.

When the Q value is calculated, the max function is utilized, so that the target value is overestimated, the error between the target value and the actual value is larger and larger along with the continuous learning, and the invention provides a Double DQN (DDQN) method for setting a high-level deep reinforcement learning network for making the estimated value and the actual value relatively close. The method constructs two action cost functions, one of which is used to estimate the action and the other of which is used to estimate the value of the action, uses the evaluation network in the network to determine the action, and uses the target network to evaluate the value of the action. The formula for calculating the Q value is as follows:

since the DDQN uses an empirical replay method, data is selected from the replay memory during sampling, so that different data have different errors, and the error is the difference δ between the target Q value of the target network and the estimated Q value of the evaluation network, which is called TD error. The larger the TD error, the greater the difference between the two Q values representing the samples, and the greater the effect on the back propagation. This error is defined using a mean square error loss function. The calculation formula of the mean square error loss function is shown in formula (3):

the network structure and loss function calculation are shown in fig. 4.

In each experience selection, if experience samples are randomly selected each time, the importance degree of each sample is ignored, so that the learning effect is poor. Selecting experience and calculating a corresponding estimated value, wherein the obtained estimated value has an error delta with an actual value, and when the delta is larger, the difference value between the Q estimated value and the Q actual value is larger, which indicates that the estimated value is not accurate enough; when δ is smaller, it means that the Q estimation value is closer to the Q actual value. In actual learning, it is necessary to make the value of |δ| as small as possible.

In order to make the error as small as possible, the empirical value should be as close as possible to the actual value at each selection of experience, so that a preferential experience replay (Prioritized Experience Replay, PER for short) method is employed at the selection. First, the probability of each experience being selected is calculated

Where p is the priority of each experience, k is the capacity of all experiences, and α is the hyper-parameter.

In the calculation of the priority number, a proportion-based priority calculation method, p, is used _i ＝|δ _i The value of |+ρ, ρ is a constant slightly greater than 0, ensuring that even |δ _i When I is 0, p _i Also greater than 0, p (i) noteq0, i.e. experience always has a probability of being chosen.

To select the empirical value with priority, the data is selected using the sum tree (SumTree) method. The summing tree is a tree-like structure, divided into nodes and branches. Each node of the big tree stores the priority number p of each experience, each node is set to extend only two branches, the priority number of the node is the sum of the priority numbers p of the nodes of the two branches, the branches extend downwards continuously, and then the node priority number at the top end of the whole big tree is the sum of the priority numbers p of all leaves.

In order to extract experience, the priority total number is first divided by the value of the small batch (batch size) of the required number of experience to be decimated to obtain a number of sub-intervals. For example, as shown in fig. 5, the p values of all nodes are summed, if a total priority value of 54 is obtained, if 6 experiences are extracted in the experience pool, the subintervals are divided into [0,9 ], [9,18 ], [18,27 ], [27,36 ], [36,45 ], [45,54] by equal division, and one value is randomly selected in each subinterval, respectively. If the value 24 is selected within the [18, 27) interval, then the search is initiated from the topmost node of the large tree down according to 24. The two child nodes at the top end 54 are 32 and 22, respectively, and when 24 is smaller than 32 of the left node, the search is performed downwards according to the branch, otherwise, the search is performed according to the other branch. When child nodes 14 and 18 come to 32, 24 is greater than 14 for the left node, then 18 of the limb is selected and the value 24 is updated to the difference from the left node 14, i.e., 10. And comparing the updated value 10 with the sub-branches of the node 18, wherein the left node 14 is larger than 10, and selecting the node and selecting the data corresponding to the node.

In using preferential experience replay, there may be a problem of changing the distribution of state features. The advantage of introducing an experience pool in the DQN is that the correlation between data can be broken, so that the data are distributed as independently as possible. The use of preferential empirical replay, however, changes the distribution of states, introducing bias. The bias is needed to be eliminated, the loss function in the neural network is optimized, the loss function of the experience priority is considered on the basis of the original formula 3, and the importance weight omega is applied _i To eliminate the effect of introducing bias.

The main difference between the two loss functions is the introduction of importance sampling weights omega in empirical priority _i 。

Where N is the total number of experiences in the experience pool and β is the hyper-parameter, indicating the degree to which the effect of PER on the convergence result is eliminated. When β=0, it means that importance sampling is not used; when β=1, since the empirical value is not randomly selected, but is selected by a priority number, all ω _i And P (i) cancel each other out, the effect indicating preferential experience replay is lost, and the normal experience replay is changed.

For stability of the algorithm, importance is weighted ω _i Normalization was performed, see equation (7).

Max in _k (ω _k ) Representing the largest value among all importance weights.

In the proposed DuelingDQN, the Q network is split into two parts, one of which is related to the state characteristics and not to actions, called the cost function part, denoted V (s; θ, β); the other part is related to both state characteristics and actions, called action dominance function part, denoted as A (s, a; θ, α). Wherein θ, α, β are parameters of the respective portions, respectively.

As shown in fig. 6, in the general DQN, after the input layer is connected with three convolution layers, two full connection layers are connected, and finally, each action Q value is output; in the lasting DQN, after the input layer is connected with three convolution layers, the obtained information is respectively transmitted to a cost function and an action dominance function, and then the cost function and the action dominance function are integrated together to output an action Q value. At the end of the calculation of the Q value, if the function values of the two are directly added together, the method is unreasonable, and a method is used for integrating the information of the two together, see formula (8)

Where A represents the total number of all actions a.

In optimizing the DQN, three partial optimizations are employed. Where DDQN consists in optimizing the calculation of the target value Q, while PER consists in optimizing the method of preferential empirical sampling when empirical values are taken, and where the lasting DQN consists in optimizing the neural network structure.

In the deep reinforcement learning process, the weight of the evaluation network is copied to the target network every C steps, in order to ensure the stability of the DQN, a soft target network updating strategy is adopted instead of directly copying the weight, and in each step of the learning process, the weight of the target network can slowly track the evaluation network to update.

Wherein lambda is a super parameter, and lambda is more than 0 and less than 1.

The greedy selection strategy is designed, and the intelligent agent needs to continuously interact with the external environment, so that the same probability of each action is ensured to be selected as far as possible, and the method is exploration; the selection of the action ensures that the selected action has the maximum Q value so as to obtain more rewards, which is called development. In the learning process, it is necessary to balance the relationship between exploration and development, and generally an epsilon greedy strategy is selected. And randomly generating a variable tau epsilon (0, 1) in the learning process, comparing the variable tau epsilon with a predefined epsilon value, selecting the action corresponding to the maximum Q value if tau < epsilon, and otherwise randomly executing an action a from the action set of the state s. Define ε as equation (33):

where e is the number of iterations, OP _sum Mu is a positive constant for the total number of steps.

The present invention is applied to a high-level network and a low-level network. In the high-level network, the input state feature number is 11, one of four optimization targets is selected through a DDQN algorithm and an epsilon greedy strategy, and the four optimization targets are named as work piece estimation advance and delay rate, work piece actual advance and delay punishment, machine average processing energy consumption and machine average utilization rate.

Firstly, the high-level intelligent body obtains the current state s of the production environment of the dynamic flexible job shop _ht Then determining a temporary optimization target g through a DDQN algorithm and an epsilon greedy strategy _t And giving the target to the low-level network; the low-level intelligent body is obtaining the target g _t And the current state s _lt Then, determining a scheduling rule a according to the D3QN and epsilon greedy strategy _t Select operation O _i,j And arranged to a processable machine M _k . High-level intelligent body at current state characteristics s _ht According to the optimization target g _t And get rewards r _ht And next state s _h(t+1) Each empirical group(s) _ht ,g _t ,r _ht ,s _h(t+1) ) Memory D for high-level experience replay _h In (a) and (b); the low-level intelligent body is based on the current state characteristics s _lt Selecting action a _t At the time of obtaining the prize value r _lt And next state s _l(t+1) Each of the obtained experience groups (s _lt ,a _t ,r _lt ,s _l(t+1) ) Memory D stored in low-level experience replay _l Then, the experience is sampled in small batches in the high-level and low-level experience replay memories respectively by adopting a priority experience replay method, and the network parameters theta of the high-level and the low-level are updated respectively according to the sampled data _ht And theta _lt 。

Drawings

FIG. 1 is a schematic diagram of an operation structure of a method for multi-objective dynamic flexible job shop scheduling with respect to energy consumption based on deep reinforcement learning according to the present invention;

FIG. 2 is a reinforcement learning basic framework;

FIG. 3 is a diagram of a neuron structure;

FIG. 4 is a diagram of a DDQN network structure;

FIG. 5 is a summing tree framework;

FIG. 6 is a DQN and DuelingDQN framework;

FIG. 7 is a hierarchical deep reinforcement learning algorithm framework;

fig. 8 is a comparison of the merits of the multi-objective solution.

Detailed Description

According to the method (see fig. 1) for multi-objective dynamic flexible job shop scheduling about energy consumption based on deep reinforcement learning, a multi-objective dynamic flexible job shop scheduling algorithm based on energy consumption is provided. The processing plant begins to present an initial number of workpieces n _c At this time, the number n of workpieces is newly inserted _n N workpieces in total (n=n _c +n _n ) The processing needs to be carried out on m machines, wherein each workpiece has a plurality of working procedures, each working procedure of each workpiece can be processed on different machines, the sequence of the working procedures is informed in advance, the processing time required by the processing of the different working procedures of the different workpieces on the different machines is different, and the processing power of the machines is also different. The processing sequences of the different processing procedures of all the workpieces on the different machines and the machines available for processing are determined so that the total time penalty of the total advance and the total delay of the workpieces is minimized and the energy consumption generated by the machines in processing all the workpieces is minimized in the whole working environment.

In the multi-objective dynamic flexible job shop scheduling problem, some constraint conditions need to be given:

1. each operation, once started to process, cannot be interrupted and no work piece is reworked;

2. the transition time of the workpiece between the two machines is negligible;

3. the working procedure of each workpiece has preferential treatment constraint, and different workpieces are mutually independent;

4. each working procedure of each workpiece can only select one machine to process;

5. multiple procedures can be processed on the same machine but cannot be processed simultaneously;

6. for one job, the next process can be processed only after the processing of the previous process is completed;

7. the work piece inserted in emergency does not affect the working procedure being processed;

8. the machine calculates the energy consumption only in the processing course, and the standby or starting course is not calculated;

9. when all the workpieces are processed, the dispatching is finished

In order to describe the scheduling problem, mathematical modeling of the entire scheduling process is required, and variable symbols and their meanings that may be used are now defined.

TABLE 1 workshop scheduling math symbols and definition thereof

Establishing a mathematical model for workshop scheduling:

C _i,j ≥0 (13)

0≤b _i,j +t _i,j ≤b _i,(j+1) (15)

b _i,j +t _i,j,k ≤b _i′,j′ +Z ⁺ ·(1-P _{k,i,j,i′,j′} ) (16)

equation (10) represents minimizing the maximum finishing time.

Equation (11) represents a penalty for minimizing the total advance and total retard finish times.

Equation (12) represents minimizing total energy consumption.

Equation (13) shows that the operation time of each workpiece is a non-negative number.

Equation (14) shows that only one machine can be selected for each process of the workpiece.

Equation (15) shows that the finishing time of the j-th process of the workpiece i is not less than the start processing time thereof, and ensures that the start time of processing is not negative.

Equation (16) indicates that the workpiece can start the next process only after the previous process is completed.

Equation (17) shows that the completion time of the j-th process of the workpiece i is equal to the start processing time plus the processing time.

Equation (18) shows a method for calculating the processing deadline of the workpiece i, and the smaller the coefficient v is, the shorter the arrival time of the workpiece can be reduced to the deadline idle time, i.e. the smaller the coefficient is, the more urgent the workpiece is than other workpieces.

The production state characteristics determined by the invention comprise the number m of processing machines and the number n of initial workpieces _c Number of newly inserted workpieces n _n The method comprises the steps of determining a work piece delivery period urgent coefficient v, a new work piece insertion index distributed speed parameter EXP, an average utilization rate of all machines, an average completion rate of all working procedures, an average completion rate of all work pieces, an average processing energy consumption of the machines, an estimated work piece advance and delay rate at a scheduling time t and an actual advance and delay penalty at the scheduling time t.

Defining the state characteristics definition variables required to be used:

(1)CT _k (t): on-line allocation at scheduling time tThe completion time of the last procedure on the device k;

(2)OP _i (t): workpiece J _i The number of currently completed procedures;

(3)U _k (t): machine utilization of machine k at schedule t:

(4)CJ _i (t): workpiece J _i Completion rate at schedule t:

(5)W _k (t): the processing energy consumption of the machine k at the time t of the dispatching is shown in a calculation formula (21);

(6)T _cur : average completion time of last procedure on all machines at current schedule t:

(7)handling operation O on available machines _i,j Average processing time of (c):

defining 11 state characteristics from the above variables

1. Number of initial workpieces n _c ；

2. Number of newly inserted workpieces n _n ；

3. The total number of machines m;

4. a workpiece delivery period pressing coefficient v;

5. a new workpiece insertion index profile rate parameter EXP;

6. average utilization U of machine _ave (t) see formula (23);

7. average completion rate of process CO _ave (t) see formula (24);

8. average work rate CJ of work piece _ave (t) see formula (25);

9. average processing energy consumption W of machine _ave (t) see equation (26);

10. estimated work advance and retard rate ET _r (t)

Definition T _left For work J _i Estimating the remaining processing time if T _left +T _cur ＞D _i Work J _i Estimating to be delayed; conversely, work J _i Will finish in advance. The estimated work advance and retard amounts are the estimated work advance amount N _early And estimated number of workpiece delays N _tardy Is a sum of (a) and (b). Estimated work advance and retard rate ET _r (t) equals N _early And N _tardy The sum divided by the number of all workpieces. Pseudocode is shown in table 2.

Table 2 calculation of estimated work piece advance and retard rates ET _r (t)

11. Actual advance and retard time penalty ET _i (t)；

ET (i) is defined as workpiece J _i Penalty for actual advance and retard times, which is equal to workpiece J _i Multiplying penalty factor per unit time by workpiece J _i Is used for the actual advance and retard times of (a) and (b). Normalizing actual early and late time penalties to ET _i (t) controlling the range thereof to be [0,1 ], the calculation formula is shown in (27), in order to prevent ET _i The value of (t) is equal to 1, and a normal number related to the workpiece n is addedAnd->If no work pieces are advanced or delayed at the scheduling time t, then->0, then ET _i (t) is also 0; if there is a work piece that is advanced and retarded, there will be a large amount of advance and retard time, so thatIs greater than->Thus ET _i (t) is close to 1. Pseudocode is shown in table 3.

Table 3 calculation of actual work delay ratio ET _i (t)

Six composite scheduling rules are designed. First define Tardy _i (T) the expiration date of the workpiece is earlier than T _cur Is not completed; UC (UC) _i And (t) is all unfinished workpieces at the scheduled time t.

Compound scheduling rule 1, if Tardy _i (t) is an empty set, i.e., there is no delay workpiece. Time D at which the unfinished workpiece is operated as its remainder _i -T _cur Sequencing, and then selecting the workpiece with the largest idle time as the next operation; if Tardy _i (t) there is no null set, i.e. there is a delay work, the estimated delay time for each workpiece isThen selecting the workpiece with the largest estimated delay time as the next operation, and after determining the operation to be processed, distributing the workpiece to the machine M _k Processing the above, wherein->

A compound scheduling rule 2, if there is no delayed workpiece, calculating the ratio of the relaxation time to the remaining processing time of each workpiece and ordering the results in ascending order, defined asThis ratio is also referred to as the critical ratio, whereby the workpiece with the smallest critical ratio is selected as the next operation; if there is a delayed workpiece, the workpiece with the greatest estimated delay time is selected. Distributing the resulting work piece to machine M _k And (3) upper part. />

A composite scheduling rule 3 for selecting a workpiece with the smallest estimated delay time if there is a delayed workpiece, and selecting an operationAs the next operation. Which is assigned to the machine with the lowest machine utilization.

Composite scheduling rules 4, defining each artifact as if there are no delayed artifactsSequencing the products in ascending order, and then selecting the workpiece with the minimum time as the next operation; otherwise, each workpiece is determined by the ratio of the estimated delay time to the workpiece cut-off time>The results of (2) are sorted in ascending order, after which the work piece of the greatest time is selected as the next operation. Finally, the operation is put at the machine utilization +.>At the lowest machine.

Compound scheduling rules 5, selecting penalty coefficients P in outstanding jobs _s The largest workpiece, see equation (28), and select operationAs a next operation and place it on the machine with the lowest machine utilization.

P _s ＝0.4×P _e +0.6×P _t (28)

The scheduling rule 6 is compounded, if there is no delayed workpiece, if there is a delayed workpiece, and it is placed on the machine with the lowest machine processing energy consumption.

The invention defines a high-level optimization target g _t Let it equal to {1,2,3,4}, respectively correspond to four different reward indexes in training process, respectively estimated advance, delay rate ET _r (t) actual advance, retard penalty EI _i (t) average processing energy consumption W of machine _ave (t) average utilization U of machine _ave (t). The low-level state action reward values are defined according to the high-level optimization objective, and are defined according to the state characteristic values of the current state s (t) and the next state s (t+1). At the scheduling time t, when g _h When=1, let _r (t) and ET _r (t+1) as a feature target; when g _h When=2, let _i (t) and ET _i (t+1) as a feature target; when g _h When=3, W is _ave (t) and W _ave (t+1) as a characteristic targetThe method comprises the steps of carrying out a first treatment on the surface of the When g _h When=4, U is _ave (t) and U _ave And (t+1) serving as a characteristic target, comparing the size relation of each characteristic target to obtain a low-level state action rewarding value, wherein the calculation formulas are shown in the expressions (29) to (32).

/>

Two layers of networks, an upper layer network and a lower layer network are arranged in deep reinforcement learning, wherein the upper layer network can control the lower layer network. Wherein the high-level network comprises an input layer, three hidden layers and an output layer. Wherein the input layer contains 11 neuron nodes, the hidden layer contains 30 neuron nodes per layer, and the output layer contains 4 neuron nodes. The activation function used is the Relu function, the loss function is the mean square error loss function (MSE), and the optimizer is Adam. The low-level network comprises an input layer, four hidden layers and an output layer. Wherein the input layer contains 12 neuron nodes, the hidden layer contains 30 neuron nodes per layer, and the output layer contains 6 neuron nodes. The activation function used is the Relu function, the loss function is the mean square error loss function, and the optimizer is Adam.

In the proposed scheduling problem of the dynamic flexible job shop, a random generation method is adopted as an example, namely, the number of the working procedures of the workpiece and the processing time of the working procedures on different machines are adopted as a random generation method. The workshop has an initial work-piece number n _c Number of workpieces n to be subsequently inserted for emergency processing _n The new inserted workpiece is according to poissonThe distribution mode reaches the workshop, namely, two workpieces which are inserted in emergency continuously obey the exponential distribution mode according to the interval time of arrival. The example generated parameter is randomly generated according to the parameter values in table 4, wherein the values in the brackets represent the values taken each time, "random" and "uniform" represent the random integer value and the random real value generated according to the values in the brackets as the range, respectively.

Table 4 workshop scheduling parameters

The parameter values used in the deep reinforcement learning network are selected according to table 5.

TABLE 5 neural network parameter values

The layered depth reinforcement learning algorithm pseudo code based on D3QN with preferential empirical playback is shown in table 6.

TABLE 6 layered depth reinforcement learning algorithm pseudo code based on D3QN with preferential empirical playback

/>

A model diagram of a D3QN based hierarchical deep reinforcement learning algorithm with preferential empirical playback applied to a dynamic flexible job shop scheduling problem is shown in fig. 7.

There are n objective functions f in the multi-objective algorithm ₁ (x),f ₂ (x),…,f _n (x) It is necessary to find a target function value that is minimizedIs a solution to (a). In FIG. 8, a solution of two objective functions is shown, where the abscissa f ₁ (x) And f ₂ (x) For two objective function values, the black closed region represents a feasible solution region, the curve between the A point and the E point is the pareto front optimal solution, and all the pareto front optimal solutions form a pareto optimal solution set. The solutions at point F are both smaller than the target values corresponding to points G and H, and are referred to as solution F being better than solutions G and H. In the points G and H, the abscissa value of the point G is smaller than the abscissa value corresponding to the point H, but the ordinate value of the point G is smaller than the ordinate corresponding to the point H, which is called solution G is indistinguishable from solution H.

In this example, there are two optimization objective functions, total advance, total delay time penalty, and total process energy consumption are minimized. In order to meet the two targets simultaneously, a group of optimal solutions on the pareto front needs to be found, and the solutions have different degrees of preference on each target, so that the trade-off relationship between the two targets can be intuitively reflected. To evaluate the quality of these solutions, some metrics are needed to evaluate the quality of the solutions. Common indicators include generation distance (Generational Distance, abbreviated GD), reverse generation distance (Inverted Generational Distance, abbreviated IGD), and Spread (Δ).

Distance of generation

GD is an indicator of the solution convergence of the pareto optimal front for an algorithm. GD measures the average of the sum of euclidean distances between the most recent true solutions in the generated solutions set. And for the solutions in each solution set, obtaining a solution similar to the pareto front, calculating the Euclidean distance from the solution to the solution on the nearest pareto front by GD, summing all the obtained Euclidean distances, and dividing the sum by the number of the pareto front solutions to obtain the value of GD.

Where N is the number of approximate pareto front solutions in the generated solution set, p is the dimension of multiple targets, p=2, d in the double target problem _i Representing the solution of the ith algorithm generated approximate pareto front to the nearest pareto frontEuclidean distance of the solution. The smaller the GD value, the closer the generated solution is to the solution on the pareto front, and if the GD value is 0, the generated solutions are all the pareto optimal solutions.

Distance of the reverse generation

IGD is an indicator that measures the uniformity of the distribution of solutions over the pareto optimal leading edge that the algorithm generates. For each solution on the pareto front, the IGD calculates its distance to the approximate pareto front solution generated by all algorithms, divided by the number N of solutions on the pareto front, resulting in the value of the IGD.

Wherein N is the number of pareto front optimal solutions, d _i And representing the distance between the ith pareto front optimal solution and the closest solution in the approximate pareto front solution generated by the algorithm. The smaller the value of the IGD, the closer the solution on the pareto front is to the approximate pareto front solution, and if the value of the IGD is 0, the generated solutions are all the pareto optimal solutions.

Spread

The Spread is used to measure the uniformity of the near pareto front optimal solution generated by the optimization algorithm.

Wherein P is the optimal solution set of the pareto front in the multi-objective optimization problem, and A is the set of the approximate pareto front solution generated by the algorithm.Representing the Euclidean distance between the jth objective function maximum solution in A and the jth objective function maximum in P, d _i,A,A Representing the Euclidean distance between the ith solution in A and its nearest solution,/>Representing all d _i,A,A N represents the number of all solutions in aThe method comprises the steps of carrying out a first treatment on the surface of the n represents the number of all objective functions. The smaller the value of the Spread, the more uniform the distribution of solutions on the pareto optimal front generated by the algorithm. />

Claims

1. A method for multi-objective dynamic flexible job shop scheduling with respect to energy consumption based on deep reinforcement learning, characterized by comprising the steps of:

Step one, constructing a high-level deep reinforcement learning network, wherein an agent of the high-level deep reinforcement learning network obtains a state feature s of a current scheduling time t according to a workshop scheduling environment _t ；

Step two, for the obtained state feature s _t An evaluation is performed. The state features s _t Inputting the features into a high-level deep reinforcement learning network for calculation, and obtaining the features s of the next state through training and learning of the network _t+1 And a bonus function r _t Reward function r _t The temporary optimization target of the current scheduling moment is obtained;

step three, the state characteristic s of the current dispatching moment is obtained _t Reward function r _t And next state feature s _t+1 Storing the data as an experience, and extracting small batches of samples from an experience pool formed by experience through a priority experience replay method for calculation so as to update parameters of a high-level deep reinforcement learning network;

setting a low-level deep reinforcement learning network, and inputting the state characteristics of the current high-level deep reinforcement learning network and the obtained temporary optimization target into the low-level deep reinforcement learning network;

step five, strengthening action a of learning network in low-level depth _t Selecting, adopting epsilon greedy algorithm, and using state characteristics and action a of current new composition _t Reward function r _t And storing the next state as an experience, and extracting a small batch of samples for calculation by a priority experience replay method so as to update the parameters of the low-level network.

2. The method for multi-objective dynamic flexible job shop scheduling with respect to energy consumption based on deep reinforcement learning according to claim 1, wherein step one, the state features are designed as the following parameters, including: the number of processing machines, the number of initial workpieces, the number of newly inserted workpieces, the workpiece lead time packing fraction, the rate parameter of the new workpiece insertion index profile, the average utilization of all machines, the average completion of all processes, the average completion of all workpieces, the average processing energy consumption of the machines, the estimated workpiece advance and retard rates at schedule time t, the actual advance and retard penalties at schedule time t.

3. The method for multi-objective dynamic flexible job shop scheduling with respect to energy consumption based on deep reinforcement learning according to claim 1, wherein in the first step, the modeling process of the high-level deep reinforcement learning network comprises:

in a common DQN algorithm, the agent depends on the state characteristics s in the environment _t Selecting an action a _t Obtain the prize value r _t The next state s _t+1 And a flag do for judging whether the round is ended, wherein do represents the round is ended, not do represents the round is not ended, and s is calculated _t 、a _t 、r _t 、s _t+1 And do constitute an experience (s _t ,a _t ,r _t ,s _t+1 Do); the updating weight value of the DQN algorithm is theta, and in the calculation process, the DQN uses two networks to evaluate the rewarding value Q of the network and the rewarding value of the target networkParameters and models of the two networks are the same, and along with the continuous progress of learning, only the weight value theta of the evaluation network is updated every time in the learning step C, and no change is made to the target network; when a certain fixed learning step C is reached, the existing weight of the evaluation network is copied to the target network, the other steps do not change the target network, the generated Q value is kept unchanged in the period of time when the target network is unchanged, so that the divergence of parameters is avoided, the target network value is kept converged and stable, and the weight of each step in the evaluation network is keptValue theta ^- ；

Designing a double DQN algorithm on the basis of a common DQN algorithm, and marking the double DQN algorithm as DDQN; the method constructs two action cost functions, one of which is used for estimating the action, the other is used for estimating the value of the action, the action is determined by using an evaluation network in the network, the value of the action is evaluated by using a target network, and the formula for calculating the Q value is as follows:

Wherein γ is the discount rate;

the difference delta between the target Q value of the target network and the estimated Q value of the estimated network during the operation of the DDQN method is called TD error; the larger the TD error, the larger the difference between the two Q values representing the samples, the larger the effect on the back propagation, the error is defined using the mean square error loss function, the calculation formula is shown in formula (3):

wherein m represents the time of the whole training process;

the DDQN method is operated in accordance with a greedy selection strategy, and the intelligent agent is required to continuously interact with the external environment, so that the same probability of each action is ensured to be selected as far as possible, and the method is exploration; the selection of the action ensures that the selected action has the maximum Q value so as to obtain more rewards, which is called development; in the learning process, the relationship between exploration and development needs to be balanced, and an epsilon greedy strategy is selected; randomly generating a variable tau epsilon (0, 1) in the learning process, comparing the variable tau epsilon with a predefined epsilon value, selecting the action corresponding to the maximum Q value if tau < epsilon, and randomly executing an action a from the action set of the state s if tau < epsilon; define ε as equation (33):

4. A method of multi-objective dynamic flexible job shop scheduling with respect to energy consumption based on deep reinforcement learning as claimed in claim 3, wherein in step three and step five, the operation method of the priority experience replay method comprises:

when experience is selected each time, if experience samples are selected randomly each time, the importance degree of each sample is ignored, so that the learning effect is poor; selecting experience and calculating a corresponding estimated value, wherein the obtained estimated value has an error delta with an actual value, and when the delta is larger, the difference value between the Q estimated value and the Q actual value is larger, which indicates that the estimated value is not accurate enough; when δ is smaller, it means that the Q estimation value is closer to the Q actual value; in actual learning, it is necessary to make the value of |δ| as small as possible;

in order to make the error as small as possible, the experience value should be as close as possible to the actual value each time an experience is selected, so that a priority experience replay method is adopted in the selection, and the probability that each experience is selected is calculated first:

wherein p is the priority of each experience, k is the capacity of all experiences, and alpha is the super parameter;

in the calculation of the priority number, a proportion-based priority calculation method, p, is used _i ＝|δ _i The value of |+ρ, ρ is a constant slightly greater than 0, ensuring that even |δ _i When I is 0, p _i Also greater than 0, p (i) noteq0, i.e. experience always has a probability of being chosen;

selecting data by using a summing tree method in order to select an experience value with priority; the summing tree is a tree-like structure, and is divided into nodes and branches; each node of the big tree stores the priority number p of each experience, each node is defined to extend only two branches, the priority number of the node is the sum of the priority numbers p of the nodes of the two branches, the branches extend downwards continuously, and then the node priority number at the top end of the whole big tree is the sum of the priority numbers p of all leaves;

when experience is extracted, firstly dividing the total priority value by the small batch value of the required number of the experience to be extracted to obtain a plurality of subintervals, randomly selecting a value in each subinterval, and searching downwards from the top of the summation tree according to the value to obtain a final value sequence and an experience value;

the bias is needed to be eliminated, the loss function in the neural network is optimized, the loss function of the experience priority is considered on the basis of the original formula 3, and the importance weight omega is applied _i To eliminate the influence of the introduced deviation; the specific loss function is formula (5):

The main difference between the two loss functions is the introduction of importance sampling weights omega in empirical priority _i ；

Wherein N is the total experience number in the experience pool, beta is the super parameter, and the degree of eliminating the influence of PER on the convergence result is represented; when β=0, it means that importance sampling is not used; when β=1, since the empirical value is not randomly selected, but is selected by a priority number, all ω _i And P (i) cancel each other out, the effect of indicating preferential experience replay is lost, and the ordinary experience replay is changed;

for stability of the algorithm, importance is weighted ω _i Normalizing, see formula (7);

max in _k (ω _k ) Representing the largest value among all importance weights.

5. A method for multi-objective dynamic flexible job shop scheduling with respect to energy consumption based on deep reinforcement learning according to claim 3, wherein in step four, the set low-level deep reinforcement learning network, its modeling process comprises:

continuing modeling on the basis of the high-level deep reinforcement learning network, wherein the process is as follows:

dividing the Q network into two parts, one of which is related to the state characteristics and not to the actions, called cost function parts, denoted V (s; θ, β); the other part is related to the state characteristics and actions, and is called an action dominance function part, and is expressed as A (s, a; theta, alpha); wherein θ, α, β are parameters of the respective parts, respectively;

In the lasting DQN (competing DQN), after the input layer is connected with three convolution layers, the obtained information is respectively transmitted to a cost function and an action dominant function, and then the cost function and the action dominant function are integrated together to output an action Q value; at last, the Q value is calculated, and a method is used for integrating the information of the Q value and the information of the Q value together, wherein the formula (8) is as follows:

wherein A represents the total number of all actions a;

in the process of optimizing the DQN, three parts of optimization are adopted; wherein Double DQN is used for optimizing the calculation of the target value Q, PER is used for optimizing a method of sampling according to priority experience when taking experience values, and Dueling DQN is used for optimizing a neural network structure;

in the deep reinforcement learning process, the weight of the evaluation network is copied to the target network every C steps, so as to ensure the stability of the DQN; the soft target network updating strategy is adopted, see formula (9), and the weight of the target network is slowly tracked and evaluated to update in each step of the learning process;

wherein lambda is a super parameter, and lambda is more than 0 and less than 1.

6. A method of multi-objective dynamic flexible job shop scheduling on energy consumption based on deep reinforcement learning as claimed in claim 3, characterized by the step two of designing temporary optimization objectives to estimate advance, retard rates, actual advance, retard penalties, average process energy consumption of the machine, and average machine utilization, optimizing one of the objectives at a time by randomization.

7. The method for multi-objective dynamic flexible job shop scheduling on energy consumption based on deep reinforcement learning according to claim 1, wherein step five, according to the optimization objective of the high-level deep reinforcement learning network, designing the reward function to calculate with four optimization objectives including estimating advance, retard rate, actual advance, retard penalty, average process energy consumption of the machine, and average utilization of the machine; comparing the next state with the current state, and if optimized, defining as a positive value; if there is no change, then define as 0; otherwise, a negative value is defined.