CN113077188A

CN113077188A - MTO enterprise order accepting method based on average reward reinforcement learning

Info

Publication number: CN113077188A
Application number: CN202110468897.0A
Authority: CN
Inventors: 吴克宇; 钱静; 陈超; 刘忠; 黄金才; 程光权; 胡星辰; 杜航
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-06
Anticipated expiration: 2041-04-28
Also published as: CN113077188B

Abstract

The invention discloses an MTO enterprise order receiving method based on average reward reinforcement learning, which comprises the following steps of: assuming order information, determining a system state set, determining a system action set, determining an immediate return function, constructing an order receiving model and solving the order receiving model; on the basis of factors considered by the traditional MTO enterprise order receiving problem, the invention increases order inventory cost and various customer priority factors, constructs an order receiving model in the semi-Markov decision process, uses the SMART algorithm to solve, and on the basis, uses the greedy algorithm to sequence and produce the received orders so as to maximize the long-term average income of the enterprise.

Description

MTO enterprise order accepting method based on average reward reinforcement learning

Technical Field

The invention relates to the technical field of enterprise order acceptance selection, in particular to an MTO enterprise order acceptance method based on average reward reinforcement learning.

Background

The MTO enterprise refers to an enterprise which is produced by the enterprise according to a client order, different clients have different requirements on the type of the order, the MTO enterprise organizes and produces the order according to the order requirements put forward by the clients, under the normal condition, the capacity of the enterprise is limited, and the enterprise cannot accept the orders of all clients due to the limitation of various cost factors, so that the MTO enterprise is required to make a corresponding order accepting method, the success of one MTO enterprise depends on the selectivity of the order accepting method to a great extent, and a good order accepting method plays a great role in the long-term profit of the enterprise;

from the existing research, some achievements have been obtained by a decision method related to order acceptance problems, but with the rapid development of electronic commerce, the personalized requirements of consumers become more and more obvious, traditional production enterprises usually do not directly contact terminal customers during product production, when the requirements of customers are diversified, the requirements are difficult to meet, and the existing order acceptance methods are not comprehensive in factors considered in the modeling process, so that order acceptance strategies cannot be effectively determined according to the production capacity and the order states of the enterprises.

Disclosure of Invention

Aiming at the problems, the invention aims to provide an MTO enterprise order receiving method based on average reward reinforcement learning, which increases order inventory cost and various customer priority factors on the basis of the factors considered by the order receiving problem of the traditional MTO enterprise, constructs an order receiving model in the semi-Markov decision process, uses the SMART algorithm to solve, and uses the greedy algorithm to perform sequencing production on the received orders on the basis so as to maximize the long-term average income of the enterprise.

In order to achieve the purpose of the invention, the invention is realized by the following technical scheme: an MTO enterprise order accepting method based on average reward reinforcement learning comprises the following steps:

the method comprises the following steps: assumption of order information

Supposing that an MTO enterprise produces through a single production line and n types of customer orders exist on the market, the order information comprises customer priority mu, price p, quantity Q, unit product production cost c, lead time LT and latest delivery time DT;

step two: determining a set of system states

According to step one, if there are n order types in the system, the system state can be represented by vector S: (μ, p, Q, LT, DT, T), where T represents the production time still required for an order that has been accepted before the decision phase;

step three: determining a set of system actions

According to step one, when a customer order arrives, a decision to accept and reject the order needs to be made, and the set of actions in the model can be represented by the vector a ═ (a)₁，a₂) Is shown in the specification, wherein a₁Indicating acceptance of an order, a₂Indicating a rejection of the order;

step four: determining an immediate reward function

After the MTO enterprise makes a decision whether to accept an order, the obtained immediate return function is as follows:

in the formula, I ═ p × Q represents the profit for the order, C ═ C × Q represents the production cost consumed, Y represents the deferred penalty cost for the enterprise, N represents the cost for producing the inventory cost, and J represents the rejection cost for the order;

step five: building order acceptance model

Constructing an order receiving model in a half Markov decision process according to a system state set, a system action set and an immediate return function, simulating a real MTO enterprise order receiving problem based on an average reward reinforcement learning idea, wherein according to a Bellman optimal theorem, a corresponding optimal strategy in the half Markov decision process problem is as follows:

wherein

Define the average reward, t, achieved during decision period m_mRepresenting the time at which decision period m transitions from state s to state s';

step six: order acceptance model solution

The method comprises the following steps of adopting reinforcement learning average reward as an evaluation target, solving an order accepting model in a half Markov decision process through an average reward reinforcement learning SMART algorithm, sequencing orders in the SMART algorithm by using a greedy algorithm to obtain an optimal order accepting decision, wherein an updating formula of the average reward reinforcement learning SMART algorithm is as follows:

where α represents the learning rate, m represents the current iteration index, r_m(s, a, s') represents the immediate reward obtained after taking action a in state s, t_m(s, a, s ') denotes the time for transition from state s to s', R_mRepresents the cumulative return, p, for the mth decision period_mRepresents the average return, t, of the mth decision period_mRepresenting the cumulative time of the mth decision period.

The further improvement lies in that: in the first step, the order of the customer achieves the poisson distribution with the obedience parameter of lambda, and the price and the required quantity of the order are evenly distributed.

The further improvement lies in that: in the second step, based on the MTO enterprise with limited energy production, if T has the maximum upper limit value and n order types, the state set S of the system has n × T states.

The further improvement lies in that: in the fourth step, the three equations of r (s, a) represent the equation when Q (s, a) is from top to bottom₁)＞Q(s，a₂) When, and at the current state, an order can be inserted into the current production plan, an immediate return is made equal to the net profit obtained by accepting the order, when Q (s, a)₁)＞Q(s，a₂) When, but the order cannot be inserted into the current production plan in the current state, an order net profit equal to the loss is immediately returned, when Q (s, a)₁)＜Q(s，a₂) The immediate return equals the rejection cost.

The further improvement lies in that: in the fourth step, the postponing penalty cost Y ═ μ × u { (T + Q/b) -LT }, where u denotes the postponing penalty cost per unit time and b denotes the unit production capacity of the enterprise.

The further improvement lies in that: in the fourth step, the product produced by the customer before the lead period is not taken in advance, so that the inventory cost N ═ Q × h { LT- (T + Q/b) } generated by temporarily storing the product in the MTO enterprise warehouse is caused, wherein h represents the unit product storage cost per unit time.

The further improvement lies in that: in the sixth step, the exploratory probability e which is reduced along with the increase of the simulation iteration number is adopted to ensure the convergence of the SMART algorithm of the average reward reinforcement learning, and alpha and e are attenuated according to a DCM scheme:

where χ represents an arbitrarily large real number.

The invention has the beneficial effects that: on the basis of factors considered by the traditional MTO enterprise order receiving problem, the invention increases order inventory cost and various customer priority factors, constructs an order receiving model in the semi-Markov decision process, uses the SMART algorithm to solve, and on the basis, uses the greedy algorithm to sequence and produce the received orders so as to maximize the long-term average income of the enterprise.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of an order acceptance method of the present invention;

FIG. 2 is a diagram of a reinforcement learning order decision interaction of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," "third," "fourth," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Referring to fig. 1 and 2, the embodiment provides an MTO enterprise order acceptance method based on average reward reinforcement learning, including the following steps:

the method comprises the following steps: assumption of order information

Supposing that an MTO enterprise produces by a single production line and n types of customer orders exist in the market, wherein order information comprises customer priority mu, price p, quantity Q, unit product production cost c, lead time LT and latest delivery time DT, the customer orders achieve Poisson distribution with compliance parameter lambda, and the price and the required quantity of the orders are uniformly distributed;

step two: determining a set of system states

According to step one, if there are n order types in the system, the system state can be represented by vector S: (mu, p, Q, LT, DT, T), where T represents the production time still needed for an accepted order before the decision phase, and T has the maximum upper limit value and n order types based on the limited-capacity MTO enterprise, then the state set S of the system has n × T states in total;

step three: determining a set of system actions

step four: determining an immediate reward function

in the formula, I ═ p × Q denotes the profit to be obtained from the order, C ═ C × Q denotes the production cost of consumption, Y denotes the deferred penalty cost of the enterprise, N denotes the cost of producing inventory, J denotes the rejection cost of the order, and the three equations of r (s, a) denote the cost of rejection when Q (s, a) is respectively expressed from top to bottom₁)＞Q(s，a₂) When, and at the current state, an order can be inserted into the current production plan, an immediate return is made equal to the net profit obtained by accepting the order, when Q (s, a)₁)＞Q(s，a₂) When, but the order cannot be inserted into the current production plan in the current state, an order net profit equal to the loss is immediately returned, when Q (s, a)₁)＜Q(s，a₂) When the product is returned immediately, the immediate return is equal to rejection cost, the delay penalty cost Y of the enterprise is mu u { (T + Q/b) -LT }, wherein u represents the delay penalty cost per unit time, b represents the unit production capacity of the enterprise, and the product produced by the customer before the lead time is not taken in advance, so that the inventory cost N of the product temporarily stored in the MTO enterprise warehouse is Q h { LT- (T + Q/b) }, wherein h represents the unit product storage cost per unit time;

step five: building order acceptance model

wherein

Represents the average return, t, achieved during decision period m_mRepresenting the time at which decision period m transitions from state s to state s';

step six: order acceptance model solution

where α represents the learning rate, m represents the current iteration index, r_m(s, a, s') represents the immediate reward obtained after taking action a in state s, t_m(s, a, s ') denotes the time for transition from state s to s', R_mRepresents the cumulative return, p, for the mth decision period_mRepresents the average reward of the mth decision period，t_mRepresenting the cumulative time of the mth decision period, the convergence of the average reward reinforcement learning SMART algorithm is guaranteed with a heuristic probability e that decreases with increasing number of simulation iterations, and α and e decay according to the DCM scheme:

where χ represents an arbitrarily large real number.

The SMART algorithm flow is as follows:

1. initializing m, Q_m(s，a)、t_m、r_m、ρ_mIs 0, e-0.2 alpha-0.1, and order _ list 2]

2.While m＜Maxsteps do

3. Calculate e according to DCM mechanism_mAnd alpha_m

4. Randomly generating a number e_randomIf em < e_randomSelecting the action a with the largest state-action cost function if e_m＞e_randomThen randomly select action a in the action set

5. If a is a₁，Q(s，a₁)＞Q(s，a₂) When the order can be inserted into the current production plan in the current state, R-C-mu Y-N, and the order is added into the to-be-produced list order _ list; if a is a₁，Q(s，a₁)＞Q(s，a₂) And cannot be inserted into the current production plan in the current state, R ═ R-C- μ x Y-N; if a is a₂，Q(s，a₁)＜Q(s，a₂)，r＝-μ*J

6. Executing action a to obtain the next stage state s', r_m(s，a，s′)，t_m(s，a，s′)

7. Updating state-action cost functions

8. If no search is taken, t is updated_m←t_m+t_m(s，a，s′)，R_m+1←R_m+r_m(s，a，s′)，p_m+1←R_m+1/t_m+1Otherwise t_m+1←t_m，R_m+1←R_m，ρ_m+1←ρ_m

9. When the order is produced, selecting the order to be produced at the next moment in the order _ list by using a greedy algorithm, and deleting the selected order from the order _ list of the queue to be produced

10. Updating the decision stage m +1

The MTO enterprise order receiving method based on the average reward reinforcement learning increases order inventory cost and various customer priority factors on the basis of factors considered by the traditional MTO enterprise order receiving problem, constructs an order receiving model in a semi-Markov decision process, solves the order by using a SMART algorithm, and performs sequencing production on the received orders by using a greedy algorithm on the basis to maximize the long-term average income of an enterprise, so that the MTO enterprise order receiving method has high order receiving and selecting capability and good adaptability to environmental changes, can balance the profit orders and various costs to bring higher income for the MT0 enterprise, can also meet the personalized demand of a customer, and keeps close connection with the customer.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An MTO enterprise order receiving method based on average reward reinforcement learning is characterized in that: the method comprises the following steps:

the method comprises the following steps: assumption of order information

step two: determining a set of system states

step three: determining a set of system actions

According to step one, when a customer order arrives, a decision to accept and reject the order needs to be made, and the set of actions in the model can be represented by the vector a ═ (a)₁,a₂) Is shown in the specification, wherein a₁Indicating acceptance of an order, a₂Indicating a rejection of the order;

step four: determining an immediate reward function

step five: building order acceptance model

wherein

step six: order acceptance model solution

2. The MTO enterprise order acceptance method based on average reward reinforcement learning according to claim 1, wherein: in the first step, the order of the customer achieves the poisson distribution with the obedience parameter of lambda, and the price and the required quantity of the order are evenly distributed.

3. The MTO enterprise order acceptance method based on average reward reinforcement learning according to claim 1, wherein: in the second step, based on the MTO enterprise with limited energy production, if T has the maximum upper limit value and n order types, the state set S of the system has n × T states.

4. The MTO enterprise order acceptance method based on average reward reinforcement learning according to claim 1, wherein: in the fourth step, the three equations of r (s, a) represent the equation when Q (s, a) is from top to bottom₁)>Q(s,a₂) When, and at the current state, an order can be inserted into the current production plan, an immediate return is made equal to the net profit obtained by accepting the order, when Q (s, a)₁)>Q(s,a₂) When, but the order cannot be inserted into the current production plan in the current state, an order net profit equal to the loss is immediately returned, when Q (s, a)₁)<Q(s,a₂) The immediate return equals the rejection cost.

5. The MTO enterprise order acceptance method based on average reward reinforcement learning according to claim 1, wherein: in the fourth step, the postponing penalty cost Y ═ μ × u { (T + Q/b) -LT }, where u denotes the postponing penalty cost per unit time and b denotes the unit production capacity of the enterprise.

6. The MTO enterprise order acceptance method based on average reward reinforcement learning according to claim 1, wherein: in the fourth step, the product produced by the customer before the lead period is not taken in advance, so that the inventory cost N ═ Q × h { LT- (T + Q/b) } generated by temporarily storing the product in the MTO enterprise warehouse is caused, wherein h represents the unit product storage cost per unit time.

7. The MTO enterprise order acceptance method based on average reward reinforcement learning according to claim 1, wherein: in the sixth step, the exploratory probability e which is reduced along with the increase of the simulation iteration number is adopted to ensure the convergence of the SMART algorithm of the average reward reinforcement learning, and alpha and e are attenuated according to a DCM scheme:

where χ represents an arbitrarily large real number.