CN112488315B

CN112488315B - Batch scheduling optimization method based on deep reinforcement learning and genetic algorithm

Info

Publication number: CN112488315B
Application number: CN202011373229.1A
Authority: CN
Inventors: 谭琦; 贾铖钰; 余荣坤; 孙晨皓; 唐昊; 夏田林
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2022-11-04
Anticipated expiration: 2040-11-30
Also published as: CN112488315A

Abstract

The invention belongs to the field of production and manufacturing scheduling, and discloses a batch scheduling optimization method based on deep reinforcement learning and a genetic algorithm, which comprises the following steps: establishing a mathematical model of a batch scheduling problem of the different workpieces; establishing a strategy model of the problem by adopting a pointer network; training a pointer network model by using an operator-critic algorithm; defining and initializing parameters of a genetic algorithm; optimizing an initial population of the genetic algorithm by using the trained pointer network; further optimizing the scheduling scheme by adopting a genetic algorithm; and the optimal scheme obtained by using the genetic algorithm is used as a production scheme for processing the workpiece by a batch processor. Compared with the traditional heuristic algorithm, the pointer network in the invention can obtain a better solution; and in the cross operation of the genetic algorithm, a novel cross mode is provided, and the optimization capability of the genetic algorithm can be improved on the basis of the scheduling scheme obtained by the pointer network, so that the performance of the scheme is further improved.

Description

Batch scheduling optimization method based on deep reinforcement learning and genetic algorithm

Technical Field

The invention belongs to the field of production and manufacturing scheduling, and particularly relates to a batch scheduling optimization method based on deep reinforcement learning and a genetic algorithm.

Background

The batch scheduling problem stems from burn-in operations used for final testing in the semiconductor manufacturing industry. In this operation, the integrated circuits are put in a high-temperature oven in batches, and a malfunction that may occur in an early stage of the integrated circuits is detected over a long period of time. Burn-in operations are often a bottleneck in semiconductor manufacturing because in final testing, its processing time is often longer than other operations. Therefore, it is important to efficiently schedule ovens (or machines) to greatly increase their utilization. At present, the batch scheduling problem exists not only in the semiconductor manufacturing industry, but also widely in most manufacturing industries, such as the foundry industry, furniture manufacturing industry, metal processing industry, aviation industry, pharmaceutical industry, and logistics freight. For most manufacturing industries, a reasonably designed scheduling strategy is also one of effective ways for improving the production efficiency and reducing the production cost. Therefore, the research on the batch scheduling problem has important practical significance for improving the production management level and obtaining higher economic benefit.

In recent years, deep neural networks based on data learning can discover the characteristics of the problem itself and thus be used to solve the problem. Therefore, the deep neural network provides a new direction for solving the combinatorial optimization problem. The existing deep neural network has little research and attention on solving the production and manufacturing scheduling problem, and the application of the deep neural network in the batch scheduling problem of different workpieces is not available.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a batch scheduling optimization method aiming at minimizing the total manufacturing time span when the workpiece to be processed has the size and the processing time difference in production.

The invention adopts the following technical scheme for solving the technical problems:

the invention relates to a batch scheduling optimization method based on deep reinforcement learning and a genetic algorithm, which comprises the following steps:

step I, establishing a mathematical model of a batch scheduling problem of different workpieces;

the differential workpiece lot scheduling problem is defined as follows, a workpiece set J = {1,2, …, n }, wherein the processing time of a workpiece J is p _j The size of the work being s _j (ii) a The volume of the batch processor is C, and the batch processor can simultaneously process a plurality of workpieces on the premise of meeting the volume constraint; the set of batches to be processed is K, wherein the processing time of the batch K is P ^k Equal to the maximum value of the workpiece processing time in lot k. X _jk Representing a decision variable, X if workpiece j is in lot k _jk =1, otherwise, X _jk ＝0。Y _k Representing a decision variable, if a batch k is established, Y _k =1, otherwise, Y _k ＝0。

According to the above definition, the batch scheduling problem for workpieces with different sizes on a single machine can be established as the following mathematical model:

an objective function:

constraint conditions are as follows:

step II, establishing a strategy model of the problem by adopting a pointer network;

step III, training a pointer network model by using an operator-critic algorithm;

step IV, defining and initializing parameters of the genetic algorithm: population size PopNum, maximum number of iterations T _GA And the number n of workpieces, the number t of iterations which are finished at present _GA ＝0；

V, optimizing an initial population of the genetic algorithm by using the pointer network trained in the step III;

step VI, solving the problem by adopting a genetic algorithm;

and step VII, using the optimal scheme obtained by the genetic algorithm as a production scheme for processing the workpiece by a batch processor.

Preferably, the step II of establishing the policy model of the problem by using the pointer network mainly includes the following steps:

the pointer network model is defined as follows, n represents the length of the encoder and decoder, and X = { X = { (X) } ₁ ,x ₂ ,…,x _n Denotes a coded input workpiece information sequence, where x is input arbitrarily _j All have x _j ＝[s _j ,p _j ] ^T ，s _j And p _j Respectively showing the size and the processing time of the jth workpiece. e = { e = ₁ ,e ₂ ,…,e _n Denotes the encoder's hidden layer state sequence, d = { d = } ₁ ,d ₂ ,…,d _n Denotes the implicit layer state sequence of the decoder, y = { y = } ₁ ,y ₂ ,…,y _n Denotes the final output sequence of the pointer network.

Step i, constructing a coding layer Network of the pointer Network, wherein the coding layer Network consists of a full connection Network layer and an RNN (RecurrentNeural Network) with an LSTM (Long Short-Term Memory) module;

step ii, constructing a decoding layer network of the pointer network, wherein the decoder network is composed of an RNN with an LSTM module;

and step iii, introducing a attention mechanism for selecting and sequencing the workpieces in the output sequence by the pointer network, wherein when the t-th workpiece is added to the output sequence, the selected probability of the rest workpieces is calculated as follows:

A(e,d _t ；W ₁ ,W ₂ ,v)＝softmax(u ^t ) (9)

preferably, the training of the pointer network model by using the operator-critic algorithm in the step III mainly includes the following steps:

step i, an actor network model adopted by the problem;

and (3) the model of the Actor network is the pointer network model established in the step (II).

Step ii, establishing a critic network model of the problem;

the structure of the criticic network consists of an encoder (RNN with LSTM module), an LSTM processing module and a decoder of a 2-layer ReLU fully-connected neural network.

Step iii, defining and initializing sample number B, total sample number D of training set and iteration number T in one training _PTR = D/B, actor network parameter θ _a Critic network parameter θ _c And training times E, wherein the initial training times epoch =0 and the initial iteration times t _PTR =0, number of trained batch samples i =0;

step iv, the workpiece information sequence x in the current batch of samples is processed _i Obtaining an output sequence y via an actor network _i ；

V, sequencing the workpiece information x in the current batch of samples _i Obtaining a corresponding baseline value b through a critic network _i ；

Step vi, making i = i +1, judging whether i < B is true, if so, skipping to execute the step iv, otherwise, skipping to the step vii;

step vii, making i =0, and solving the loss value of the operator network by using a monte carlo sampling approximation reinforcement learning algorithm, wherein the calculation formula is as follows:

step viii, using the mean square error as the loss value of the criticic network, and calculating the formula as follows:

step ix, optimizing parameters of the operator network and the critic network by using an Adam algorithm;

step x, let t _PTR ＝t _PTR +1, judgment t _PTR <T _PTR If yes, skipping to the step iv for execution, otherwise, executing the step xi;

step xi, order t _PTR =0, epoch = epoch +1, judge epoch<E, judging whether the E is established, if so, executing the step iv, otherwise, jumping to the step xii;

step xii, obtaining a trained operator network model;

preferably, the step V of optimizing the initial population of the genetic algorithm by using the trained pointer network mainly includes the following steps:

step i, generating an individual in a population by adopting a real number coding mode and an LPT (Long Processing Time) heuristic rule;

step ii, generating PopNum-1 individuals in the population by adopting a triangular fuzzy number mode;

step iii, obtaining a new population from the individuals in the population through a pointer network;

and iv, sequencing all individuals in the two populations in an ascending order according to the fitness value, and taking the front PopNum individuals as an initial population of the genetic algorithm.

Preferably, the step VI of further solving the problem by using a genetic algorithm mainly comprises the following steps:

step i, selecting PopNum parent individuals according to a roulette mode;

step ii, combining all the parent individuals pairwise, and generating child individuals by adopting an improved multipoint intersection mode;

step iii, performing variation operation on all the filial generation individuals in a single-point variation mode;

step iv, let t _GA ＝t _GA +1, and calculating fitness values of all individuals;

v, judging t _GA <T _GA If yes, jumping to the step i, otherwise executing the step vi;

and step vi, finishing the algorithm, and outputting the optimal scheduling scheme.

Preferably, the step ii of generating the child by using the improved multi-point intersection method mainly comprises the following steps:

step a, initializing a descendant to be empty, selecting Parent1 and Parent2 to be crossed, enabling the current inheritance Parent to be = Parent1, enabling gene replication starting position Index to be =0, and randomly generating the number num of replication genes, wherein the range of the num is 1-n;

b, starting from the place where the subscript in the parent is Index, searching num genes which are not in the filial generation to the left side and the right side, if the searched genes exist in the filial generation, ignoring the searched genes, if the searched subscript reaches the boundary, stopping searching in the current direction, and if all the genes are copied, directly turning to the step d;

c, copying the gene segments searched in the step to offspring according to the sequence of the gene segments in parent;

d, judging whether the filial generation copies all the genes in the parent, if so, skipping to the step f, otherwise, executing the step e;

e, enabling parent to be another parent, enabling Index to be the value of the last gene of the current child, regenerating the number num of the genes to be copied, and jumping to the step b;

and f, obtaining the crossed filial generation individuals.

The intelligent manufacturing has important significance for improving the comprehensive competitiveness of the Chinese manufacturing industry and realizing the strategic transformation from big to strong of the Chinese manufacturing industry. The development and the application of industrial big data are pushed, the function of the industrial big data in the manufacturing industry is fully played, and the method is one of the main directions of intelligent manufacturing. The invention takes the batch production scheduling problem of the different workpieces as a research object. A new optimization method is designed by utilizing related technologies such as deep reinforcement learning and the like. Compared with the prior art, the invention has the beneficial effects that:

1. aiming at the problem of batch scheduling of different workpieces, the invention takes the maximum completion time as a target, trains the pointer network model through an operator-critic algorithm, and can obtain a high-quality scheduling strategy in a short time by using the trained pointer network model.

2. The pointer network model obtained by training in the invention has better generalization capability, namely the pointer network model trained under the small-scale problem can also effectively solve the large-scale problem. Therefore, in the actual production process, when the current preparation time is limited, the time spent in the training process can be reduced by training the pointer network model under the small-scale problem. And when the preparation time is sufficient, the optimization performance of the model can be further improved by fully training the pointer network model.

3. Compared with the prior heuristic rule, the pointer network has the advantages that although the solving time is equivalent, the solving effect is superior to the prior heuristic rule. Therefore, for a production scenario with a scheduling policy with high real-time performance, the pointer network trained in the step III above may be directly adopted to obtain a scheduling scheme.

4. The invention provides a novel cross mode in the cross operation of the genetic algorithm, and the optimization capability of the genetic algorithm can be utilized on the basis of the scheduling scheme obtained by the pointer network to further improve the performance of the scheme.

Drawings

FIG. 1 is a flow chart of pointer network model training.

FIG. 2 is a flow chart of the PTR-GA optimization algorithm.

Fig. 3 is a schematic diagram of an improved crossover mode of operation.

FIG. 4 is a graph comparing solving performance of a pointer network and a BFLPT heuristic algorithm.

FIG. 5 is a pointer network generalization capability diagram.

FIG. 6 is a comparison graph of solving performance of PTR-GA and GA.

Detailed Description

To more clearly illustrate the objects, aspects and advantages of the present invention, the present invention will be described in further detail below with reference to the accompanying drawings, in which only some specific embodiments are shown.

The application provides a batch scheduling optimization method based on deep reinforcement learning and genetic algorithm aiming at workpiece sequences with different sizes and different processing times and aiming at minimizing total manufacturing time span, and the simplified problems are specifically described as follows:

(1) Workpiece set J = {1,2, …, n }, wherein the machining time of workpiece J is p _j The size of the work being s _j 。

(2) The batch processor has a machine capacity of C, workpieces with a workpiece set J are processed in batches, and the total size of the workpieces in all batches to be processed is not larger than C.

(3) The set of batches to be processed is K, wherein the processing time of the batch K is P ^k ，P ^k Representing the maximum processing time for all workpieces in lot k.

(4) The manufacturing span is the sum of the processing times of all processing batches.

As shown in fig. 1,2 and 3, based on the introduction of the above problem, the batch scheduling optimization method based on deep reinforcement learning and genetic algorithm provided by this embodiment includes the following steps:

the differential workpiece lot scheduling problem is defined as follows, a workpiece set J = {1,2, …, n }, wherein the processing time of a workpiece J is p _j The size of the workpiece is s _j (ii) a The volume of the batch processor is C, and the batch processor can simultaneously process a plurality of workpieces on the premise of meeting the volume constraint; the set of batches to be processed is K, wherein the processing time of the batch K is P ^k Equal to the maximum value of the workpiece processing time in lot k. X _jk Representing a decision variable, X if workpiece j is in lot k _jk =1, otherwise, X _jk ＝0。Y _k Representing a decision variable, if a batch k is established, then Y _k =1, otherwise, Y _k ＝0。

an objective function:

constraint conditions are as follows:

equation (1) represents that the optimization goal of the model is to minimize the manufacturing span; formula (2) indicates that one workpiece can be limited to only one batch; equation (3) indicates that the total size of the workpieces to be machined in the batch cannot exceed the machine capacity of the batch processor; the formula (4) represents that the processing time of the batch is not less than the processing time of the workpieces to be processed in the current batch; equations (5) to (7) are fundamental constraints of the problem.

And step II, establishing a strategy model of the problem by adopting a pointer network, wherein the specific construction mode of the pointer network is as follows:

the pointer network model is defined as follows, n represents the length of the encoder and decoder, and X = { X = { (X) } ₁ ,x ₂ ,…,x _n Denotes a coded input workpiece information sequence in which x is input arbitrarily _j All have x _j ＝[s _j ,p _j ] ^T ，s _j And p _j Respectively showing the size and the processing time of the jth workpiece. e = { e = ₁ ,e ₂ ,…,e _n Denotes the encoder's hidden layer state sequence, d = { d = } ₁ ,d ₂ ,…,d _n Denotes the implicit layer state sequence of the decoder, y = { y = } ₁ ,y ₂ ,…,y _n Denotes the final output sequence of the pointer network.

I, constructing an encoding layer network of the pointer network, wherein the encoder network consists of a full-connection network layer and an RNN with an LSTM module;

and step iii, introducing an attention mechanism for selecting and sequencing the workpieces in the output sequence by the pointer network, wherein when the t-th workpiece is added to the output sequence, the selected probability of the remaining workpieces is calculated as follows:

A(e,d _t ；W ₁ ,W ₂ ,v)＝softmax(u ^t ) (9)

and III, training a pointer network model by using an operator-critic algorithm, wherein the specific training steps of the algorithm are as follows:

step i, an operator network model adopted by the problem;

Step ii, establishing a critic network model of the problem;

Step iii, defining and initializing sample number B in one training, and trainingTotal number of samples D, number of iterations T _PTR = D/B, actor network parameter θ _a Critic network parameter θ _c And training times E, wherein the initial training times epoch =0 and the initial iteration times t _PTR =0, number of trained batch samples i =0;

Step vi, letting i = i +1, judging whether i < B is true, if so, skipping to execute step iv, otherwise, skipping to step vii;

step vii, let i =0, and solve the loss value of the operator network by using the monte carlo sampling approximation reinforcement learning algorithm, the calculation formula is as follows:

step xii, obtaining a trained operator network model;

step IV, defining and initializing parameters of the genetic algorithm: seed of a plantGroup size PopNum, maximum number of iterations T _GA And the number n of workpieces, the number t of iterations that have been completed at present _GA ＝0；

And V, optimizing an initial population of the genetic algorithm by using the pointer network trained in the step III, wherein the generation mode of the initial population comprises the following steps:

step i, generating an individual in the population by adopting a real number coding mode and an LPT heuristic rule;

and iv, sequencing all the individuals in the two populations in an ascending order according to the fitness value, and taking the first PopNum individuals as an initial population of the genetic algorithm.

Step VI, solving the problem by adopting a genetic algorithm, wherein the specific solving steps are as follows:

step i, selecting PopNum parent individuals according to a roulette mode;

step ii, combining all parent individuals pairwise, and generating child individuals by adopting an improved multipoint intersection mode, wherein the improved multipoint intersection mode comprises the following specific contents:

step a, initializing a child to be empty, selecting Parent1 and Parent2 to be crossed, enabling current genetic Parent = Parent1, gene replication initial position Index =0 and randomly generating number num of replicated genes, wherein the range of num is 1-n;

b, starting from the position where the subscript in the parent is Index, searching num genes which are not in filial generation to the left side and the right side, if the searched genes exist in the filial generation, ignoring the searched genes, if the searched subscript reaches the boundary, stopping searching in the current direction, and if all the genes are copied, directly turning to the step d;

and f, obtaining the crossed filial generation individuals.

step iv, let t _GA ＝t _GA +1, and calculating individual fitness value;

step v, judgment of t _GA <T _GA If yes, jumping to the step i, otherwise executing the step vi;

Performance verification

To verify the performance of the batch scheduling optimization method based on deep reinforcement learning and genetic algorithm proposed in this embodiment, the algorithm is compared with GA algorithm (Damodaran P, manjeshwar P K, srihari K. Minimizing mask on a batch-processing mask with non-intermediate joint using genetic algorithms [ J ]. International Journal of Production Economics,2006,103 (2): 882-891.) and BFLPT heuristic algorithm (Dupont L, ghazini F J. Minimizing mask on a batch processing mask with non-intermediate joint [ J ]. Journal of processing system [ J ]. 8978): 8978-431.

To verify the performance of the algorithm, different numbers of workpieces n, sizes of workpieces s, are used _j And a workpiece machining time p _j A series of problem instances are randomly generated for the test algorithm. In this embodiment, four sets of workpiece numbers n = {10,20,50,100}; dimension s of work _j Is set as [4,8]The random integers are uniformly distributed; machining time p of workpiece _j Is set as [1,20]Is also uniformly distributed. Number of each workpiece for accuracy of test dataRandomly generating 1000 examples for comparing the pointer network with the BFLPT, and randomly generating 100 examples for comparing a genetic algorithm (PTR-GA) for optimizing an initial population by using the pointer network with an original Genetic Algorithm (GA) in each workpiece quantity in consideration of time consumed by solving; the machine capacity is set to 10.

For objectivity of the results, the parameters of the PTR-GA and the parameter settings of the GA used for comparison are kept consistent with the original document, and the learning parameters set when training the pointer network are as follows: the number of batch samples B =128, the total number of training set samples D =128000, the learning rate η of the Actor and critical network parameters =0.001 when the number of workpieces is 10 or 20, the learning rate η of the Actor and critical network parameters =0.0001 when the number of workpieces is 50 or 100, and the training time E is 40.

Under the condition of the example parameter setting, the solving performance of the pointer network and the BFLPT heuristic algorithm is compared with that of the graph shown in FIG. 4, the generalization capability of the pointer network on the scale of other workpieces is shown in FIG. 5, and the solving performance of the PTR-GA and GA is compared with that of the graph shown in FIG. 6. Fig. 4 mainly shows that in 1000 calculation examples of different numbers of workpieces, the completion time of the pointer network to solve the problem is compared with the advantages and disadvantages of the results obtained by the BFLPT heuristic algorithm, and it can be seen that as the number of workpieces increases, the number of completion times obtained by the pointer network is gradually increased over the number of BFLPT heuristic algorithms. Under the condition of small scale, the pointer network is less in calculation proportion than the BFLPT heuristic algorithm because the BFLPT heuristic algorithm can also solve most of optimal solutions. Therefore, the pointer network has better performance than the BFLPT heuristic algorithm under the large-scale condition, and the solving performance of the pointer network is slightly better than the BFLPT heuristic algorithm under the small-scale condition, but the solving time is slightly higher than the BFLPT heuristic algorithm. Fig. 5 mainly shows the generalization ability of the pointer network model trained on small-scale examples on other workpiece scales, where the meaning of the ordinate is the difference between the average value of 1000 examples solved by the BFLPT heuristic algorithm and the average value of 1000 examples solved by the pointer network model, and the abscissa is the number of workpieces. It can be seen that even in the pointer network trained by small-scale calculation, along with the increase of the number of workpieces, the trend that the solving completion time is superior to the BFLPT heuristic algorithm is more obvious. Therefore, the pointer network in the invention has better generalization capability. Fig. 6 mainly shows the comparison of the solving performance of the PTR-GA and the GA when the number of workpieces is 100, wherein the meaning represented by the ordinate is the average value of the completion time of each algorithm for solving 100 sets of examples, and the abscissa represents the iteration number of the algorithm.

Claims

1. The batch scheduling optimization method based on the deep reinforcement learning and the genetic algorithm is characterized by comprising the following steps of: the method comprises the following steps:

the differential workpiece lot scheduling problem is defined as follows, a workpiece set J = {1,2, …, n }, wherein the processing time of a workpiece J is p _j The size of the workpiece is s _j (ii) a The volume of the batch processor is C, and the batch processor can simultaneously process a plurality of workpieces on the premise of meeting the volume constraint; the set of batches to be processed is K, wherein the processing time of the batch K is P ^k Equal to the maximum value of the workpiece processing time in batch k; x _jk Representing a decision variable, X if workpiece j is in lot k _jk =1, otherwise, X _jk ＝0；Y _k Representing a decision variable, if a batch k is established, Y _k =1, otherwise, Y _k ＝0；

an objective function:

constraint conditions are as follows:

step IV, defining and initializing parameters of the genetic algorithm: population size PopNum, maximum number of iterations T _GA And the number n of workpieces, the number t of iterations that have been completed at present _GA ＝0；

step VI, solving the problem by adopting a genetic algorithm;

step VII, using the optimal scheme obtained by the genetic algorithm as a production scheme for processing the workpiece by a batch processor;

the initial population using the trained pointer network optimized genetic algorithm described in step V mainly comprises the following steps:

iv, sequencing all individuals in the two populations in an ascending order according to the fitness value, and taking the front PopNum individuals as an initial population of the genetic algorithm;

step VI, further solving the problem by adopting a genetic algorithm mainly comprises the following steps:

step i, selecting PopNum parent individuals according to a roulette mode;

step iv, let t _GA ＝t _GA +1, and calculating the fitness value of all the filial generation individuals;

step vi, finishing the algorithm, and outputting an optimal scheduling scheme;

step ii in step VI, generating offspring in an improved multipoint intersection manner, mainly includes the following steps:

c, copying the gene segments searched in the step into filial generations according to the sequence of the gene segments in parent;

and f, obtaining the crossed filial generation individuals.

2. The batch scheduling optimization method based on deep reinforcement learning and genetic algorithm of claim 1, wherein: step II, establishing the strategy model of the problem by adopting the pointer network mainly comprises the following steps:

the pointer network model is defined as follows, n represents the length of the encoder and decoder, and X = { X = { (X) } ₁ ,x ₂ ,…,x _n Denotes a coded input workpiece information sequence, where x is input arbitrarily _j All have x _j ＝[s _j ,p _j ] ^T ，s _j And p _j Respectively representing the size and the processing time of the jth workpiece; e = { e = ₁ ,e ₂ ,…,e _n Denotes the encoder's hidden layer state sequence, d = { d = } ₁ ,d ₂ ,…,d _n Denotes the implicit layer state sequence of the decoder, y = { y = } ₁ ,y ₂ ,…,y _n Denotes the final output sequence of the pointer network;

step ii, constructing a decoding layer network of the pointer network, wherein the decoder network is formed by an RNN with an LSTM module;

A(e,d _t ；W ₁ ,W ₂ ,v)＝softmax(u ^t ) (9)。

3. the batch scheduling optimization method based on deep reinforcement learning and genetic algorithm of claim 1, wherein: the pointer network model trained by the operator-critic algorithm in the step III mainly comprises the following steps:

step i, an actor network model adopted by the problem;

the Actor network adopts the pointer network model established in the step II;

step ii, establishing a critic network model of the problem;

the structure of the Critic network consists of an encoder, an LSTM processing module and a decoder of a 2-layer ReLU fully-connected neural network, wherein the encoder is an RNN with the LSTM module;

V, enabling the workpiece information sequence x in the current batch of samples _i Obtaining a corresponding baseline value b through a critic network _i ；

and step xii, obtaining the trained actor network model.