CN114625089B

CN114625089B - Job shop scheduling method based on improved near-end strategy optimization algorithm

Info

Publication number: CN114625089B
Application number: CN202210255402.0A
Authority: CN
Inventors: 刘歆宁; 张明会
Original assignee: Dalian Neusoft University of Information
Current assignee: Dalian Neusoft University of Information
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2024-05-03
Anticipated expiration: 2042-03-15
Also published as: CN114625089A

Abstract

The invention discloses a job shop scheduling method based on an improved near-end strategy optimization algorithm, which comprises the following steps: s1: defining processing information of a job shop; s2: defining workshop operation environment state information; s3: defining a reward function based on a scheduling target and a time stamp, and acquiring an initial reward function value; s4: optimizing the initial rewarding function value to obtain a dense rewarding function value; s5: and establishing an improved near-end strategy optimization algorithm model, and acquiring an optimized workpiece serial number to be processed according to the processing information of the workshop, the workshop operation environment state information and the dense rewards function value. The invention establishes an improved near-end optimization algorithm, and can balance learning precision and speed through defined workshop environment state information, so that the scheduling method of the invention can train quickly on the basis of more state effective information, and meanwhile, the robustness of the training result of the invention is higher through rewarding values of dense rewarding functions based on the output of scheduling targets.

Description

Job shop scheduling method based on improved near-end strategy optimization algorithm

Technical Field

The invention relates to the technical field, in particular to a job shop scheduling method based on an improved near-end strategy optimization algorithm.

Background

At present, market competition is increasingly strong, the demands of various products are continuously updated, enterprises gradually change from the traditional large-scale production mode to the small-scale personalized production mode with multiple varieties, and an efficient workshop scheduling scheme is particularly important under the pressure of industry competition such as shorter delivery time, higher reliability, faster product replacement and the like. Efficient, low cost shop scheduling decisions are fundamental to the operation of the production system. Therefore, how to realize more autonomy, preoccupation and intellectualization of workshop scheduling becomes one of the most concerned problems of domestic and foreign scholars and enterprises.

The traditional heuristic algorithm mainly comprises a genetic algorithm and an intelligent swarm algorithm (such as an ant colony algorithm, a particle swarm algorithm and an artificial bee colony algorithm), and generally has the defects of weak local searching or global searching capability, low accuracy of a later algorithm and low convergence speed. In the traditional reinforcement learning algorithm, strategy gradients need to be continuously interacted with workshop operation environments, so that training is slow. In addition, the learning rate parameter is not easy to adjust, the convergence rate is slow due to too small, and training is unstable due to too large. DQN (deep Q network) has the problem of overestimation of Q value and many super parameters, resulting in too high training cost and not strong robustness. Meanwhile, the conventional algorithm for scheduling job shops has the following problems: ① In the training process of reinforcement learning in a job shop, because of huge action space, an optimal scheme is found in the action space, which results in too high training cost. Meanwhile, definition of the state needs to be balanced in learning accuracy and speed. Under the same algorithm, the more the state effective information is, the better the learning effect is, but the more the information is, the more the calculation amount in the neural network is brought, and the performance is reduced. ② Because the influence of each step of action on the global final scheduling result in the training process is difficult to estimate, the rewarding value is almost given after the round is finished, so that the intelligent agent cannot obtain enough rewarding values, namely sparse rewarding is obtained, and the training is slow or the training is difficult to learn effectively.

Disclosure of Invention

The invention provides a job shop scheduling method based on an improved near-end strategy optimization algorithm, which aims to overcome the technical problems.

In order to achieve the above object, the technical scheme of the present invention is as follows:

A job shop scheduling method based on an improved near-end policy optimization algorithm comprises the following steps:

S1: defining processing information of a job shop;

s2: defining workshop operation environment state information;

S3: defining a reward function based on a scheduling target and a time stamp, and acquiring an initial reward function value;

s4: optimizing the initial rewarding function value to obtain a dense rewarding function value;

s5: and establishing an improved near-end strategy optimization algorithm model, and acquiring an optimized workpiece serial number to be processed according to the processing information of the workshop, the workshop operation environment state information and the dense rewards function value.

Further, the machining information of the job shop in S1 includes: workpiece serial number p to be processed: p is less than or equal to n, and the serial number q of the machine to be processed is as follows: q is less than or equal to m, and machining time of the p-th workpiece on the q-th machine is _pq;

The processing information of the job shop is expressed in a matrix form as:

Wherein n represents the total number of workpieces to be processed; m represents the total machine number; n _p represents the p-th workpiece to be machined; m _pq denotes machining the p-th workpiece on the q-th machine.

Further, the workshop operation environment state information in S2 includes:

state [0]: indicating whether the current workpiece can be executed;

wherein:

wherein: false indicates that the workpiece cannot be executed, true indicates that the workpiece can be executed;

state [1]: indicating the remaining time of the normalized current procedure;

wherein:

state[1]＝max(0,time_left_current_op-difference)/max_time_op (3)

Wherein: time_left_current_op represents the remaining time of the current process; difference represents the time interval from the last state update under the current timestamp; max_time_op represents the maximum scheduling time in all procedures;

state [2]: representing the percentage of the current process to be performed;

wherein:

state[2]＝time_step_job/machines (4)

Wherein: time_step_job represents the sequence number currently being executed; the machines represents the number of machines required for the current workpiece;

state [3]: indicating the total remaining time of the normalized current workpiece;

wherein:

state[3]＝total_perform_op_time_jobs/max_time_jobs (5)

total_perform_op_time_jobs＝total_perform_op_time_jobs_o+ min(difference,time_left_current_op) (6)

Wherein: total_ perform _op_time_jobs represents the total remaining time of the current workpiece; max_time_job represents the maximum cumulative schedule time among all the workpieces; total_ perform _op_time_jobs_o represents the total remaining time of the last machined workpiece;

state [4]: indicating the machine normalization available time required by the next procedure;

wherein:

state[4]＝max(0,time_until_available_machine–difference)/max_time_op (7)

wherein: time_until_available_machine represents the required machine availability time;

state [5]: indicating normalized idle time after completion of the previous process

Wherein:

state[5]＝(difference-time_left_current_op)/sum_op (8)

Wherein: sum_op represents the cumulative scheduling time of all working procedures of all workpieces;

state [6]: representing normalized cumulative idle time;

state[6]＝old_state[6]+state[5] (9)

wherein: old_state [6] represents normalized cumulative idle time after the last process is completed.

Further, the reward function in S3 is:

R(t)＝T_q–idle(t_q,t_q+1) (10)

Wherein: r (t) represents a bonus function; t _q denotes a time when the execution of the q-th process in the p-th workpiece (p.ltoreq.n) is started; t _q+1 represents a time when the (q+1) th process in the (p.ltoreq.n) th workpiece starts to be executed; idle represents a function of the idle time from t _q to t _q+1; t _q represents the finishing time required for the current process.

Further, in S4, the process of optimizing the initial prize function value is as follows:

R_t＝γR_t-1+r_t (11)

Wherein: gamma represents a discount factor; r _t-1 represents the jackpot value at time t-1; r _t represents the jackpot value at time t; r _t represents the initial bonus function value output at time t;

solving the mean and variance of the initial reward function values of the additional discount factors at the time t:

Wherein: a _t represents the mean value of the initial bonus function values at time t, V _t represents the variance of the initial bonus function values at time t;

Initializing A _t and V _t;

the standard deviation processing is performed on the initial bonus function value r _t output at the time t as follows:

r_tb＝r_t/sqrt(V_t) (14)

Wherein: r _tb represents the initial bonus function value after standard deviation treatment; sqrt represents an evolution operation;

cutting the initial rewarding function value after the standard deviation processing to obtain a dense rewarding function value as follows:

r_ty＝clip(r_tb,-1,1) (15)

wherein: r _ty denotes the value of the dense bonus function.

Further, in the step S5, the following is established based on the improved proximal policy optimization algorithm model:

wherein: A prize value expectation representing a target parameter θ; e represents the expectation of the random variable in one training round; τ is one training round of random variables; r (τ) represents a prize value distribution for a training round; p _θ (τ) represents the probability distribution of the learning trajectory, p _θ′ (τ) represents the probability distribution of the sampling trajectory, τ to p _θ (τ) represents the distribution function of τ as p _θ (τ);

wherein: Representing a gradient function in a continuous form; s _t represents the state at time t, and a _t represents the operation at time t; pi _θ represents a policy with θ as a target parameter; a ^θ(s_t,a_t) is a dominance function, a ^θ′(s_t,a_t) is a dominance function in continuous form; p _θ(a_t|s_t) represents a target policy to take action a _t in state s _t;

the formula (12) is written in discrete form as follows:

wherein: Representing the dominance function in discrete form; /(I) Representing a gradient function in discrete form;

a strategy for representing interaction with a shop work environment; clip is a clamping function and epsilon represents a hyper-parameter.

Further, before S5, the method further includes the following steps:

If the working procedures of two workpieces are started at the same time under the same time stamp, a workpiece processing working procedure priority selection strategy is formulated; the workpiece processing procedure priority selection strategy is as follows:

Assuming that the workpiece p1 and the workpiece p2 are simultaneously allocated to the same machine, the workpiece p1 has 1 process step from the completion of the processing, the workpiece p2 has f process steps from the completion of the processing,

If f >1, then p1.state [0] =false, p2.state [0] =true in the workpiece p 1;

Otherwise: state [0] =true, p2.State [0] =false;

Wherein: state [0] represents the state [0] of the work p1, and state [0] represents the state [0] of the work p2.

The beneficial effects are that: the invention provides a job shop scheduling method based on an improved near-end strategy optimization algorithm, which establishes the improved near-end optimization algorithm, and can balance learning precision and speed through defined shop environment state information, so that the job shop scheduling method based on the improved near-end strategy optimization algorithm can be based on more state effective information, the training speed is high, and meanwhile, the robustness of the training result is high through further optimizing the reward value of a dense reward function based on the output of a scheduling target.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of a job shop scheduling method of the present invention;

FIG. 2 is a graph comparing prize values obtained before and after the improved near-end policy optimization algorithm of the present invention;

FIG. 3 is a Gantt chart of the present invention using an improved proximal strategy optimization algorithm;

FIG. 4 is a graph of the mean value of the prize values for each round of the dense prize function of the present invention;

FIG. 5 is a graph of the overall loss value for the present invention;

FIG. 6 is a schematic diagram of an improved-based near-end optimization algorithm of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment provides a job shop scheduling method based on an improved near-end policy optimization algorithm, as shown in fig. 1 and 6, comprising the following steps:

The embodiment firstly needs to define the processing information of a job shop;

S1: defining processing information of a job shop;

The processing information of the job shop in S1 includes: the serial number p of the workpiece to be processed, the serial number q of the machine to be processed and the processing time _pq of the p-th workpiece on the q-th machine;

the specific information of the job shop is expressed as a matrix:

Wherein n represents the total number of workpieces to be processed; m represents the total machine number; n _p represents the p-th workpiece to be machined; m _pq denotes machining the p-th workpiece on the q-th machine;

Specifically, assuming that there are n workpieces, each workpiece is required to be machined by m machines, each workpiece is machined on a specific machine equipment, which is called a "process", and the machining time of each process is determined in advance. The processing of each workpiece is predetermined, and therefore, a workpiece can be considered as a component of a series of processes. Therefore, the job shop scheduling in this embodiment is to schedule a reasonable workpiece processing sequence and start time to meet the minimum maximum completion time of the workpiece. The steps for converting the mathematical problem of the job shop into a scheduling environment are as follows:

In the present embodiment, assuming that the number of workpieces n=15 and the number of machines m=15, the problem to be solved is the job shop scheduling problem for processing 15 workpieces in the shop on 15 machines. The specific information of the job shop is expressed in text form as shown in table 1: in the specific meaning, when the 1 st workpiece N ₁ is processed, M ₁₁ represents the 1 st workpiece processed on the 1 st machine, time ₁₁ represents the processing man-hour for the 1 st workpiece processed on the 1 st machine, M ₁₂ represents the 1 st workpiece processed on the 2 nd machine, and time ₁₂ represents the processing man-hour … for the first workpiece processed on the 2 nd machine. Specifically, line 1 indicates that work piece 1 first requires 94 man hours to be performed on machine 6, then 66 man hours to be performed on machine 12, and so on, so that every adjacent two numbers can represent the process information of the current work piece. The 15 data defines all the process information required for all the workpieces. Two adjacent elements in each row of table 1 are combined into one binary tuple, labeled as hypothetical: a _pq＝(M_pq,time_pq); this can represent the 15x15 matrix in 15x30 text in the job shop original information in the table as follows:

Table 1 raw information of job shop

6	94	12	66	4	10	7	53	3	26	2	15	10	65	11	82	8	10	14	27	9	93	13	92	5	96	0	70	1	83
																														4	74	5	31	7	88	14	51	13	57	8	78	11	8	9	7	6	91	10	79	0	18	3	51	12	18	1	99	2	33
1	4	8	82	9	40	12	86	6	50	11	54	13	21	5	6	0	54	2	68	7	82	10	20	4	39	3	35	14	68
																														5	73	2	23	9	30	6	30	10	53	0	94	13	58	4	93	7	32	14	91	11	30	8	56	12	27	1	92	3	9
7	78	8	23	6	21	10	60	4	36	9	29	2	95	14	99	12	79	5	76	1	93	13	42	11	52	0	42	3	96
																														5	29	3	61	12	88	13	70	11	16	4	31	14	65	7	83	2	78	1	26	10	50	0	87	9	62	6	14	8	30
12	18	3	75	7	20	8	4	14	91	6	68	1	19	11	54	4	85	5	73	2	43	10	24	0	37	13	87	9	66
																														11	32	5	52	0	9	7	49	12	61	13	35	14	99	1	62	2	6	8	62	4	7	3	80	9	3	6	57	10	7
10	85	11	30	6	96	14	91	0	13	1	87	2	82	5	83	12	78	4	56	8	85	7	8	9	66	13	88	3	15
																														6	5	11	59	9	30	2	60	8	41	0	17	13	66	3	89	10	78	7	88	1	69	12	45	14	82	4	6	5	13
4	90	7	27	13	1	0	8	5	91	12	80	6	89	8	49	14	32	10	28	3	90	1	93	11	6	9	35	2	73
																														2	47	14	43	0	75	12	8	6	51	10	3	7	84	5	34	8	28	9	60	13	69	1	45	3	67	11	58	4	87
5	65	8	62	10	97	2	20	3	31	6	33	9	33	0	77	13	50	4	80	1	48	11	90	12	75	7	96	14	44
																														8	28	14	21	4	51	13	75	5	17	6	89	9	59	1	56	12	63	7	18	11	17	10	30	3	16	2	7	0	35
10	57	8	16	12	42	6	34	4	37	1	26	13	68	14	73	11	5	0	8	7	12	3	87	2	83	9	20	5	97

S2: defining the workshop operation environment state information;

The workshop operation environment state information can be directly input into an improved near-end optimization algorithm for training to obtain corresponding scheduling actions, so that the quality of the scheduling results is directly influenced by the quality of the workshop operation environment state information definition. If the workshop work environment state information is incomplete, good scheduling results cannot be obtained even with a better reinforcement learning algorithm. Also, if there is excessive or redundant shop environment status information, the training process may be slow.

Before the workshop operation environment state information definition is carried out, the operation workshop information is traversed firstly, and the following values are obtained:

maximum cumulative scheduling time in all workpieces: max_time_jobs

Maximum scheduling time in all procedures: max_time_op

Cumulative scheduling time for all working procedures of all workpieces: sum_op

The number of machines required for the current workpiece: MACHINES

The above values are static data and remain unchanged during the scheduling process. A set of 7-dimensional vectors is then defined representing each moment in time shop floor job environment state information, which is dynamic data that changes continuously as the scheduling process proceeds. When the state information of the workshop working environment is updated each time, a current time stamp is obtained, so that the time interval difference between the current time stamp and the last state update can be obtained.

The workshop operation environment state information comprises:

state [0]: indicating whether the current workpiece can be executed; the state [0] can accelerate the training process of the near-end strategy optimization algorithm, and the mask can be used for directly shielding the work piece which cannot be executed, so that the training process is prevented from being free of purpose;

wherein:

state [1]: representing and normalizing the residual time of the current process, namely normalizing the residual time of the current process to be between 0 and 1 so as to improve the subsequent optimized learning performance;

wherein:

state[1]＝max(0,time_left_current_op-difference)/max_time_op (3)

state [2]: representing the percentage of the current process to be performed;

wherein:

state[2]＝time_step_job/machines (4)；

Wherein: time_step_job represents the sequence number currently being executed;

state [3]: representing the total residual time of the normalized current workpiece, namely normalizing the total residual time of the current workpiece to be between 0 and 1 so as to improve the subsequent optimized learning performance;

wherein:

state[3]＝total_perform_op_time_jobs/max_time_jobs (5)

Wherein total_ perform _op_time_jobs=total_ perform _op_time_jobs_o+min (6)

state [4]: representing the machine normalization available time required by the next process, namely normalizing the available time of the machine required by the next process to be between 0 and 1 so as to improve the subsequent optimized learning performance;

wherein:

state[4]＝max(0,time_until_available_machine–difference)/max_time_op (7)

If time_unretil_available_machine-difference < =0, it indicates that the occupied machine is already idle and can be used by the next process of the current workpiece.

State [5]: indicating normalized idle time after the completion of the previous process (here, non-accumulation time, if the current idle time needs to be cleared after entering the next process), normalizing the idle time after the completion of the previous process to be between 0 and 1 so as to improve the subsequent optimized learning performance;

wherein:

state[5]＝(difference-time_left_current_op)/sum_op (8)

state [6]: representing normalized cumulative idle time; the state variable is actually an accumulated value of state [5], is not cleared in the process, is used for recording the accumulated idle time of the current workpiece, and normalizes the accumulated idle time to be between [0-1] so as to improve the subsequent optimized learning performance;

state[6]＝old_state[6]+state[5] (9)

wherein: old_state [6] represents normalized accumulated idle time after the last process is completed;

When a workpiece is machined, under the condition that the speed and the precision are considered, the 7 workshop operation environment state information is selected, and the calculation speed of the neural network can be ensured due to the small dimension of the state vector. Meanwhile, the 7 workshop operation environment state information is representative, other common scheduling rules can be conveniently constructed, and besides the minimum maximum available time used in the embodiment, if a first-in first-out (FIFO) rule is used, only the state [5] is required to be maximized; if the rule of the longest processing time (MWKR) of the residual working procedure is used, the state [3] is only required to be maximized, and the method can be conveniently applied to different rules.

S3 defining a reward function based on the scheduling object and the time stamp,

The present embodiment defines a reward function based on scheduling objectives and timestamp splitting. In this embodiment, the scheduling objective is to minimize the maximum finishing time, that is, to minimize the time taken for the work piece with the longest finishing time, and when the execution of the q-th process in the p-th (p.ltoreq.n) -th work piece is started, the time is defined as T _q, the prize value is initialized to the finishing time T _q required for the current process, and the time when the next process q+1 starts is defined as T _q+1. And then returns the prize value r _t immediately after each machine action, and, then,

The reward function is:

R(t)＝T_q–idle(t_q,t_q+1) (10)

wherein: r (t) represents a bonus function; idle represents a function of the idle time from t _q to t _q+1; t _q represents the completion time required for the current process; t _q denotes a time when the execution of the q-th process in the p-th workpiece (p.ltoreq.n) is started; t _q+1 represents a time when the (q+1) th process in the (p.ltoreq.n) th workpiece starts to be executed; if the idle time between two adjacent working procedures is larger, r _t is smaller, the accumulated reward value is smaller, and negative feedback is obtained. The larger the idle time, i.e. the smaller the machine utilization, the larger the corresponding maximum finishing time.

Similarly, if the idle time is smaller, r _t is larger, the accumulated prize value is also larger, and the forward feedback is obtained. The smaller the idle time, the smaller the maximum completion time. Since the prize value is large at this time, the smart body will approach in the direction of small idle time with a large probability. The agent in this embodiment refers to a model based on an improved proximal optimization algorithm.

However, if a process of a plurality of workpieces is started at the same time under the same time stamp, the above-described bonus function is disadvantageous in that, when calculating the idle time, only the idle time of the process ending later can be calculated. Therefore, the method of splitting the time stamps is adopted in the embodiment, namely, the task started at the same time is split into the front time stamp and the rear time stamp, and because the sampling time interval is short enough relative to the scheduling time, the precision losses can be ignored, and meanwhile, more dense rewarding values can be obtained, so that the learning is easier.

S4: and optimizing the reward value of the initial reward function to obtain the reward value of the dense reward function.

Preferably, the embodiment performs standard deviation processing and clipping operation on the prize value r _t of the initial prize function output each time, wherein the cumulative prize value is provided with the sliding discount factor, so that the difficulty of fitting an agent caused by the scale problem of the prize value can be avoided.

After the standard deviation processing of the cumulative prize value with the sliding discount factor is performed on the prize value r _t of the initial prize function, the intelligent agent can obtain the prize value of the dense prize function in the training process, and thus shorter maximum finishing time can be obtained. The algorithm comprises the following steps:

the prize value r _t of the initial prize function output at the time t is read, and the cumulative prize value r _t at the time t is calculated by using the cumulative prize value M _t-1 at the time t-1, and the formula is as follows:

The optimization formula for the initial reward function value is as follows:

R_t＝γR_t-1+r_t (11)

Wherein: γ represents a discount factor, γ=1 in this embodiment; r _t-1 represents the jackpot value at time t-1; r _t represents the jackpot value at time t; r _t represents the prize value of the initial prize function output at time t.

The cumulative prize value for the additional discount factor may be solved by sliding through this formula.

Solving the mean and variance of the reward values of the additional discount factors at the moment t. The traditional method for solving the standard mean and variance formulas needs to save the rewarding value of each time t, and finally uses the standard formulas to calculate, so that a large amount of memory is wasted. In this embodiment, a recursive formula is used for solving, and the specific formula is as follows:

Wherein: a _t represents the mean value of the prize value of the initial prize function at time t, and V _t represents the variance of the prize value of the initial prize function at time t;

And (3) recording: rs= (a _t,V_t)

In the present embodiment, a _t and V _t are initialized such that a _t＝0;V_t =0;

According to formulas (15) to (17), RS is continuously updated to (a _t,V_t).

The instant rewarding value r _t output by the workshop working environment at the time t is divided by the standard deviation and is input into the intelligent agent, and the formula is shown as follows

r_tb＝r_t/sqrt(V_t) (14)

Wherein: r _ty represents the prize value of the initial prize function after standard deviation processing; sqrt represents an evolution operation;

And finally, clipping the rewarding value of the initial rewarding function processed by the standard deviation by adopting a clamping function, wherein the clipping operation is as follows:

Clipping is performed to limit r _t to [ -1,1 ]. The specific formula is as follows

r_ty＝clip(r_tb,-1,1) (15)

Wherein: r _ty represents the prize value of the dense prize function.

In this embodiment, before step S5 is performed, a workpiece processing procedure priority selection policy is formulated:

The workpiece processing procedure priority selection strategy is as follows:

If f >1, then p1.state [0] =false, p2.state [0] =true in the workpiece p 1;

Otherwise: state [0] =true, p2.State [0] =false;

In this embodiment, in order to obtain higher performance for the agent, a workpiece processing process priority selection strategy is adopted to reduce the action search space. Specifically, taking arbitrary 2 workpieces p1 and p2 in the scheduling process as an example, it is assumed that p1 has f+1 steps. If at this point p1 and p2 need to be assigned to the same machine. While p1 has now completed f steps, the whole work piece is completed with only the last 1 step, and p2 has a number of steps not yet completed, at which time the allocation of the machine to p2 is the preferred solution. Therefore, the mask is set to be false for p1.state [0] of p1, so that the intelligent agent automatically ignores p1 during searching the action space, and the interactive learning speed of the intelligent agent and the workshop operation environment is increased.

The traditional strategy gradient algorithm in reinforcement learning is an on-pole online mode, namely, an agent for reinforcement learning and an agent for interaction with a workshop operation environment are the same, but a problem is generated that after the parameters are updated by interaction with the workshop operation environment each time, the probability distribution of the parameters is changed, so that the previous data cannot be utilized, and the training process is very slow. After the near-end strategy optimization algorithm is adopted, on-pole can be modified to off-pole off-line, so that data interacted with workshop operation environment can be utilized for a plurality of times, and the training speed is obviously improved. With off policy, it is ensured that the strategy of learning agents and agents interacting with the shop work environment are different, so that the probability distribution is different.

The improved near-end strategy optimization algorithm model is established as follows:

the statistical importance sampling method is utilized to obtain:

wherein: A prize value expectation representing a target parameter θ; e represents the expectation of the random variable in one training round; τ is one training round of random variables; r (τ) represents a prize value distribution for a training round; p _θ (τ) represents the probability distribution of the learning trajectory, p _θ′ (τ) represents the probability distribution of the sampling trajectory, and τ to p _θ (τ) represent the random variable τ distribution function as p _θ (τ). According to the formula, the probability distribution of p _θ (tau) can be converted into the probability distribution p _θ′ (tau) of the sampling track, and the target parameter theta is trained by interactive sampling of the sampling parameter theta' and the workshop working environment. Thus, although the target parameter theta is continuously updated, the sampling parameter theta 'is relatively fixed, so that the sampling parameter theta' can be utilized for a plurality of times, and the data use efficiency is improved. In this embodiment, θ and θ' cannot differ too much, otherwise both expect to differ much though the same.

In the present embodiment, the gradient ascent method is used for the calculationMaximum, gradient/>Can be written as:

wherein: Representing a gradient function in a continuous form; s _t represents the state at time t, and a _t represents the operation at time t; pi _θ represents a policy with θ as a target parameter; a ^θ(s_t,a_t) is a dominance function, a ^θ′(s_t,a_t) is a dominance function in continuous form; indicating that the agent takes action a _t in state s _t as advantageous over other actions; p _θ(a_t|s_t) represents a target policy to take action a _t in state s _t; the theta is replaced by theta' in the formula, namely the importance sampling is utilized to change the target strategy probability distribution into action strategy probability distribution, and the parameter significance is unchanged;

as described above, θ and θ 'in the present embodiment cannot differ too much, so in the iteration of θ, a clamping mechanism is added to clamp θ and θ' within a certain range,

The formula (12) is written in discrete form as follows:

A strategy for representing interaction with a workshop work environment, referred to as a behavior strategy; and theta ^k is kept unchanged in the learning process, and the updating target strategy p _θ(a_t|s_t is continuously trained after interaction with the workshop operation environment to obtain data, so that theta is continuously updated in the learning process. Theta ^k is updated every multiple rounds later. Introducing u _t (theta) into a formula to describe the difference between a target strategy and a behavior strategy, wherein clip is a clamping function, which means that u _t (theta) is limited within the range of [ 1-epsilon, 1+ [ epsilon ], epsilon represents a super-parameter, and p _θ and/> The ratio of (2) is as close to 1 as possible, and the effects of theta and theta ^k are ensured not to be too different. In the present embodiment, when/>At this time, the action a _t at time t is encouraged, so that the solid line rises slowly in the positive direction of the X axis, but cannot be made to exceed 1+ε, ensuring that θ cannot differ too much. Similarly, when/>When the motion a _t at the moment t is not performed, the solid line slowly descends along the negative direction of the X axis, but cannot be lower than 1-E, and the fact that the difference of theta cannot be too large is ensured.

Specifically, in this embodiment, different super parameters are set randomly by a grid search method, and training is performed, so that prize values learned after an improved near-end policy optimization algorithm is enabled and disabled are observed respectively, as shown in fig. 2: wherein the horizontal axis represents the cumulative prize values arranged from large to small and wherein each bar represents the magnitude of the prize value for a set of super parameters, the top 20 sets of data being taken. The vertical axis indicates the prize value size. original indicates the magnitude of the prize value without the improved near-end policy optimization algorithm being enabled, norm_walk indicates the prize value increased after the prize value of the dense prize function is enabled to be optimized, and it can be found that the improved algorithm has a larger improvement on the training result, that is, a shorter maximized finishing time can be obtained.

For the workshop 15×15 in this embodiment, after training the agent by using the improved near-end policy optimization algorithm, the result of the sweet-ter chart after actual running is shown in fig. 3:

The cumulative idle time (normalized) for each workpiece after the completion of the scheduling results is shown in table 2, with the minimum maximum completion time as an objective function.

Table 2 workshop per workpiece idle schedule of 15x15 in the example

Workpiece	Idle time	Workpiece	Idle time
				0	0.04215577	1	0.02990318
2	0.05466541	3	0.04361237
				4	0.02767543	5	0.02698998
6	0.04763945	7	0.02964613
				8	0.01585126	9	0.02690429
10	0.04404078	11	0.02656156
				12	0.02176335	13	0.04335533
14	0.05869249

As can be seen from the data in table 2, the idle time is small during the operation of each process for each workpiece, the overall utilization of the machine is high, and the need to minimize the maximum usable time is met.

The running environment of the near-end optimization strategy algorithm adopted in the embodiment is shown in table 3:

table 3 list of learning algorithm running environments employed in this embodiment

Operating system	macOS-11.4-x86-64bit
		Processor and method for controlling the same	Intel Core i7
Number of nuclei	6
		Memory	16G
Python version	3.9.7

The super-parameter configuration based on the improved near-end policy optimization algorithm in this embodiment is shown in table 4:

Table 4 hyper-parameter configuration table in this embodiment

According to table 4, even if the super-parameters are fine-tuned, the training results are stably converged, the influence on the training results is small, and the robustness is high.

The whole training process of this embodiment takes 10 minutes, and specific indexes are shown in fig. 4 and 5: wherein episode _rewind_mean represents the average of the prize values for each round and total_loss represents the total loss value. As can be seen, the average value of the reward value is continuously increased in the training process, the loss value is continuously reduced, and the reward value gradually converges. The loss value is already close to 0 after the 5 th round, and the prize value can be already close to 19 when the number of rounds is 23, so that a good training effect can be obtained.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The job shop scheduling method based on the improved near-end strategy optimization algorithm is characterized by comprising the following steps of:

S1: defining processing information of a job shop;

The processing information of the job shop in S1 further includes: machine number q to be machined: q is less than or equal to m, and machining time of the p-th workpiece on the q-th machine is _pq;

The processing information of the job shop is expressed in a matrix form as:

Wherein, p represents the serial number of the workpiece to be processed; n represents the total number of workpieces to be processed; m represents the total machine number; n _p represents the p-th workpiece to be machined; m _pq denotes machining the p-th workpiece on the q-th machine;

s2: defining workshop operation environment state information;

the workshop operation environment state information in S2 includes:

state [0]: indicating whether the current workpiece can be executed;

wherein:

state [1]: indicating the remaining time of the normalized current procedure;

wherein:

state[1]＝max(0,time_left_current_op-difference)/max_time_op (3)

state [2]: representing the percentage of the current process to be performed;

wherein:

state[2]＝time_step_job/machines (4)

wherein:

state[3]＝total_perform_op_time_jobs/max_time_jobs (5)

total_perform_op_time_jobs＝total_perform_op_time_jobs_o+

min (difference, time_left_current_op) (6) formula: total_ perform _op_time_jobs represents the total remaining time of the current workpiece; max_time_job represents the maximum cumulative schedule time among all the workpieces; total_ perform _op_time_jobs_o represents the total remaining time of the last machined workpiece;

wherein:

state[4]＝max(0,time_until_available_machine–difference)/max_time_op(7)

Wherein:

state[5]＝(difference-time_left_current_op)/sum_op (8)

state [6]: representing normalized cumulative idle time;

state[6]＝old_state[6]+state[5] (9)

The reward function in S3 is:

R(t)＝T_q–idle(t_q,t_q+1) (10)

Wherein: r (t) represents a bonus function; t _q denotes a time when the execution of the q-th process in the p-th workpiece is started; t _q+1 denotes a time when the (q+1) th process in the (p) th workpiece starts to be performed; idle represents a function of the idle time from t _q to t _q+1; t _q represents the completion time required for the current process;

In the step S4, the process of optimizing the initial prize function value is as follows:

R_t＝γR_t-1+r_t (11)

Initializing A _t and V _t;

r_tb＝r_t/sqrt(V_t) (14)

r_ty＝clip(r_tb,-1,1) (15)

wherein: r _ty denotes a dense bonus function value;

S5: establishing an improved near-end strategy optimization algorithm model, and acquiring an optimized workpiece serial number to be processed according to the processing information of the workshop, the workshop operation environment state information and the dense rewards function value;

In the step S5, the improved near-end strategy optimization algorithm model is established as follows:

wherein: A prize value expectation representing a target parameter θ; e represents the expectation of the random variable in one training round; τ is one training round of random variables; r (τ) represents a prize value distribution for a training round; p _θ (τ) represents the probability distribution of the learning trajectory, p _θ' (τ) represents the probability distribution of the sampling trajectory, τ to p _θ (τ) represents the distribution function of τ as p _θ (τ);

wherein: Representing a gradient function in a continuous form; s _t represents the state at time t, and a _t represents the operation at time t; pi _θ represents a policy with θ as a target parameter; a ^θ(s_t,a_t) is a dominance function, a ^θ'(s_t,a_t) is a dominance function in continuous form; p _θ(a_t|s_t) represents a target policy to take action a _t in state s _t;

the formula (12) is written in discrete form as follows:

2. The job shop scheduling method based on the improved near-end policy optimization algorithm according to claim 1, wherein: before the step S5, the method further comprises the following steps:

If f >1, then p1.state [0] =false, p2.state [0] =true in the workpiece p 1;

Otherwise: state [0] =true, p2.State [0] =false;