CN114625091A - Optimization control method and device, storage medium and electronic equipment - Google Patents

Optimization control method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN114625091A
CN114625091A CN202210277509.5A CN202210277509A CN114625091A CN 114625091 A CN114625091 A CN 114625091A CN 202210277509 A CN202210277509 A CN 202210277509A CN 114625091 A CN114625091 A CN 114625091A
Authority
CN
China
Prior art keywords
control
optimization
model
action
constructed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210277509.5A
Other languages
Chinese (zh)
Inventor
朱翔宇
殷宏磊
徐浩然
郑宇�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong City Beijing Digital Technology Co Ltd
Original Assignee
Jingdong City Beijing Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong City Beijing Digital Technology Co Ltd filed Critical Jingdong City Beijing Digital Technology Co Ltd
Priority to CN202210277509.5A priority Critical patent/CN114625091A/en
Publication of CN114625091A publication Critical patent/CN114625091A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B19/00Programme-control systems
    • G05B19/02Programme-control systems electric
    • G05B19/418Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM]
    • G05B19/41885Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM] characterised by modeling, simulation of the manufacturing system
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/32Operator till task planning
    • G05B2219/32339Object oriented modeling, design, analysis, implementation, simulation language
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Manufacturing & Machinery (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The application discloses an optimization control method, an optimization control device, a storage medium and electronic equipment. Based on the above, training and learning of the optimization model are performed through the offline data set, and when a complex objective function and a nonlinear dynamical model are faced, optimization control is performed through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And moreover, the model predictive control framework is adopted to construct the optimization model, so that even if a new control task target or a control task with additional constraint is faced, retraining and learning of the optimization model are not needed, and the adaptability and the control flexibility of the optimization control method are improved.

Description

Optimization control method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of automatic control technologies, and in particular, to an optimization control method, apparatus, storage medium, and electronic device.
Background
In the production links of various industries, a large number of links of system operation control exist: such as robot control, automatic operation systems of agricultural machinery, automatic control links of intelligent manufacturing, various operation control systems in the fields of energy, chemical engineering, metallurgy and the like in the industrial industry, and the like. By optimizing and controlling the control systems, the utilization efficiency of resources can be improved, the waste of time, materials, energy and the like is reduced, the competitiveness of the industrial industry is improved, the green development target of energy conservation and emission reduction is realized, and the method has great significance for the development and progress of the industry.
Conventional optimization Control methods in industrial Control applications, such as Proportional-Integral-Derivative (PID) controllers and Model Predictive Control (MPC) controllers.
In the optimization problem of a complex control system, the traditional optimization control method has poor effect. On one hand, the solving capability of the traditional optimization control method limits the optimization effect of the traditional optimization control method when the traditional optimization control method faces increasingly complex control systems; on the other hand, the traditional optimization control method lacks effective utilization of mass data precipitated in a control system, and seriously depends on human experience, theoretical derivation or simulation environment consistent with real environment during model design, and in the face of complex targets and nonlinear dynamic models, the solution is difficult and inefficient, so that the control method lacks universality. In addition, a large amount of computing resources are needed in the learning process of the control system offline strategy, and the application is limited in the scene of limited computing resources; the method is lack of adaptability to the change of control task targets and poor in control flexibility.
Therefore, the conventional optimization control method has poor generality and poor control flexibility.
Disclosure of Invention
In view of this, the present application discloses an optimization control method, an optimization control device, a storage medium, and an electronic apparatus, which aim to improve the versatility, the adaptability, and the control flexibility of the optimization control method.
In order to achieve the purpose, the technical scheme is as follows:
the first aspect of the present application discloses an optimization control method, including:
acquiring the current system state of a control system;
inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain an optimization control action recommendation quantity; the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework;
and executing corresponding optimized control operation based on the optimized control action recommended quantity.
Preferably, the inputting the current state of the system into a pre-constructed optimization model for processing based on a preset optimization strategy to obtain a recommended amount of the optimization control action includes:
carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer greater than or equal to 1;
acquiring an original track sequence of the N control tracks;
selecting a target track sequence set which meets a preset condition from the original track sequence;
carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence;
and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.
Preferably, the performing the trajectory optimization on each trajectory in the target trajectory sequence set to obtain an optimized action sequence includes:
summing the reward values of all tracks in the target track sequence set to obtain an accumulated reward value;
and performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward value to obtain an optimized action sequence.
Preferably, the process of constructing the dynamic characteristics model includes:
acquiring an offline data set of a control system; the offline data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
and constructing the dynamic characteristic model through a deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on the off-line data set.
Preferably, the construction process of the behavior strategy model includes:
acquiring an offline data set of a control system; the offline data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
and constructing the behavior strategy model through a deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on the off-line data set.
Preferably, the process of constructing the action value function model includes:
acquiring an offline data set of a control system; the off-line data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
the action value function model is constructed from the offline data set and the fitted cost function estimate FQE.
Preferably, the method further comprises the following steps:
and if the monitored control task of the control system changes, adjusting the recommended quantity of the optimized control action based on a target adaptive control strategy and/or a constraint control strategy.
A second aspect of the present application discloses an optimization control apparatus, the apparatus comprising:
the acquisition unit is used for acquiring the current system state of the control system;
the processing unit is used for inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain the recommended quantity of the optimization control action; the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework;
and the execution unit is used for executing corresponding optimized control operation based on the optimized control action recommended quantity.
A third aspect of the present application discloses a storage medium, which includes stored instructions, wherein when the instructions are executed, a device in which the storage medium is located is controlled to execute the optimization control method according to any one of the first aspect.
A fourth aspect of the present application discloses an electronic device, comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the optimization control method according to any one of the first aspect.
According to the technical scheme, the optimization control method, the optimization control device, the storage medium and the electronic equipment are used for obtaining the current system state of the control system, inputting the current system state to the pre-constructed optimization model based on the preset optimization strategy for processing to obtain the optimization control action recommendation quantity, jointly constructing the pre-constructed dynamic characteristic model, the pre-constructed behavior strategy model and the pre-constructed action value function model through the model prediction control framework by the optimization model, and executing the corresponding optimization control operation based on the optimization control action recommendation quantity. Based on the scheme, the training and learning of the optimization model are carried out through the off-line data set, and when a complex objective function and a nonlinear dynamic model are faced, the optimization control is carried out through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And moreover, the model predictive control framework is adopted to construct the optimization model, and even if a new control task target or a control task with added constraint is faced, the optimization model does not need to be retrained and learned, so that the adaptability and the control flexibility of the optimization control are improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of an optimization control method disclosed in an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating obtaining recommended optimal control actions according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart illustrating an optimized action sequence disclosed in an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an optimization control device disclosed in an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As can be seen from the background art, the existing optimization control method has poor versatility and poor control flexibility.
In order to solve the above problems, embodiments of the present application disclose an optimization control method, an optimization control device, a storage medium, and an electronic device, which achieve the purpose of improving the versatility, the adaptability, and the control flexibility of the optimization control. The specific implementation is specifically illustrated by the following examples.
Referring to fig. 1, a schematic flow chart of an optimization control method disclosed in an embodiment of the present application is shown, where the optimization control method mainly includes the following steps:
s101: and acquiring the current system state of the control system.
In S101, the control system includes a robot control system, an agricultural machine automatic operation control system, a thermal power generation boiler control system, a wind power generation control system, and the like.
The current state of the system is used to indicate characteristics of the current operating state of the system controlling the system.
The current state of the system is different for different control systems. If the control system is a thermal power generation control system, the current state of the system includes a temperature state of wind, a pressure state of wind, a temperature state of water, a pressure state of water, and the like.
S102: inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain an optimization control action recommendation quantity; and the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework.
In S102, the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model, and a pre-constructed action value function model through a finite time domain model predictive control framework.
The specific dynamic characteristic model is constructed as follows:
firstly, acquiring an offline data set of a control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.
The offline data sets corresponding to different control systems are different, and if the control system is a thermal power generation control system, the offline data sets include temperature history data of wind, pressure history data of wind, temperature history data of water, pressure history data of water, and the like.
The control systems are different in corresponding system characteristic data, and if the control system is a thermal power generation control system, the system characteristic data includes temperature data, wind pressure data, water temperature data, water pressure data, and the like.
The preset history period is a period of time during which a history data record exists before the current time. The predetermined historical period may be 1 to 2 years or several months, determined primarily by the time of accumulation of historical data in the control system.
And then, constructing a dynamic characteristic model through the deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on an off-line data set.
The dynamic characteristic model is constructed and trained by adopting a deep neural network based on an off-line data set of a collection control system to obtain the dynamic characteristic model, and the expression of the dynamic characteristic model is shown as a formula (1).
(rt,st+1)=fm(st,at) (1)
Wherein r istA value for the currently acquired reward; st+1The state of the system at the next moment; stThe current state of the system; a istFor the control action to be carried out, fmIs a dynamic behavior model.
The input of the dynamic characteristic model is the current state s of the systemtAnd the control action a executedtThe output is the currently obtained reward value rtAnd the state s of the next moment of the systemt+1. Through initializing different model parameters of the dynamic characteristic model, training to obtain K dynamic characteristicsAnd in the sexual model, the value of K is an integer which is more than or equal to 1, and the expressive force of the dynamic characteristic model is further enhanced by integrating the output results of the K dynamic characteristic models (the output results of all the dynamic characteristic models are averaged or the result of one model is randomly selected as output).
The construction process of the specific behavior strategy model is as follows:
acquiring an offline data set of a control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.
And constructing a behavior strategy model through the deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on an off-line data set.
The behavior strategy model is constructed and trained by adopting a deep neural network based on the collected control system offline data set, and the expression of the behavior strategy model is shown as a formula (2).
at=fb(st) (2)
Wherein, atA control action to be performed; f. ofb(st) Is a behavioral policy model.
The input of the behavior strategy model is the current state s of the systemtOutput as an action performed at. The K behavior strategy models are obtained through initializing different model parameters of the behavior strategy models, training, and the expressive force of the behavior strategy models is further enhanced through integrating output results of the K models (averaging the output results of all the behavior strategy models or randomly selecting the result of one model as output).
The construction process of the action value function model is as follows:
acquiring an offline data set of a control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.
And (3) constructing an action value function model through the offline data set and the Fitted value function Evaluation (FQE), wherein the expression of the action value function model is shown as a formula (3).
Figure BDA0003556154490000071
Wherein the content of the first and second substances,
Figure BDA0003556154490000072
as a function of the action value; f(s)i,ai) A model is a function of the action value in training; f is the selected action value function model class; y isiIs a training target; the value of N is the data volume of the training data set; the value of i is an integer from 1 to N.
yiThe formula (4) is shown below.
Figure BDA0003556154490000073
Wherein r isiIs a prize value; gamma 'is a reward discount factor, and gamma' is less than 1;
Figure BDA0003556154490000074
the action value function obtained for the last training iteration is in(s)i+1,ai+1) The value of (d); siThe state at the moment i; a isiThe action at the moment i; si+1The state at the moment i + 1; a isi+1The action at the moment i + 1; and B is an offline data set.
Equation (5) can be further estimated by the action value function.
Figure BDA0003556154490000075
Wherein, Vb(st) As a function of value;
Figure BDA0003556154490000076
expressing the expectation of the action value function under the action sampled according to the action strategy; q(s)tAnd a) is a function of the action value.
The preset optimization strategies comprise a track sampling method, a track pruning method and a track optimization method. Inputting the current state of the system into a pre-constructed optimization model for optimization processing through a track sampling method, a track pruning method and a track optimization method to obtain an optimal control action sequence
Figure BDA0003556154490000077
Then controlling the action sequence from the optimum
Figure BDA0003556154490000078
Picking up action
Figure BDA0003556154490000079
As the optimal control action recommendation.
And (3) constructing an optimization model:
and (3) jointly constructing a dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a finite time domain model predictive control framework to obtain an optimization model, wherein the expression of the optimization model is shown as a formula (6).
Figure BDA0003556154490000081
Wherein r istThe reward value at the moment t; vb(sH) Is a state sHThe value of the following value function. The constraint condition of the optimization problem is the dynamic characteristic model (r)t,st+1)=fm(st,at)。
At each step, the current state s of the system is given0By solving the above finite time domain optimization problem, an optimal control action sequence of length H can be obtained, where s0=sinit,sinitIs given as the characteristic value of the current state of the system.
Specifically, the current state of the system is input to a pre-constructed optimization model for processing based on a preset optimization strategy, and a process of optimizing the recommended quantity of the control action is obtained, as shown in A1-A5.
A1: carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer of 1 or more.
Wherein, the state s of the system at the current time is used as a starting point, and a dynamic characteristic model f is utilizedmAnd behavior strategy model fbAnd carrying out track sampling, and simulating to generate N control tracks with the length of H, wherein the value of H is an integer greater than 1, and the sampling method is a track sampling method under the guidance of an action value function.
Taking the generation of one of the trajectories as an example, at the t-th time step on the control trajectory, the dynamic characteristic model f is usedmThe predicted current state of the system is stDefine the behavior strategy f at this timeb(st) Mean value μ of the generated motion distributiona(st) And standard deviation sigmaa(st)。
Behavior strategy fb(st) Mean value μ of the generated motion distributiona(st) The expression of (c) is shown in equation (7).
Figure BDA0003556154490000082
Wherein the content of the first and second substances,
Figure BDA0003556154490000083
including the mean of each dimension of the motion distribution, i.e.
Figure BDA0003556154490000084
A is the motion space and T is the transposed symbol.
Behavior strategy fb(st) Standard deviation σ of the generated motion distributiona(st) The expression of (c) is shown in equation (8).
Figure BDA0003556154490000085
Wherein the content of the first and second substances,
Figure BDA0003556154490000086
the standard deviation of each dimension of the motion distribution is included,
at time t, the control action is sampled by equations 9, 10 and 11.
Figure BDA0003556154490000091
Wherein the content of the first and second substances,
Figure BDA0003556154490000092
for the action obtained by sampling, N represents a normal distribution and a constant σMScaling the standard deviation of the distribution of each dimension of motion by a standard deviation scaling factor to adjust the aggressiveness, σ, of the motion sampleM>0。
Figure BDA0003556154490000093
Wherein M istA set of m sampled actions; t is the time; h is the length of the control track.
Figure BDA0003556154490000094
Wherein Q isb(stA) is a function of action value;
Figure BDA0003556154490000095
to be MtSubstituting each action into action value function Qb(stA) corresponding actions when the maximum result is obtained after calculation respectively; h is the length of the control track.
Selecting an action value function Qb(stA) maximum action
Figure BDA0003556154490000096
Is transported last timeCarrying out the optimal control sequence obtained by solving the solving process
Figure BDA0003556154490000097
Control action of corresponding time
Figure BDA0003556154490000098
And mixing according to the mixing coefficient beta to obtain sampling action
Figure BDA0003556154490000099
Figure BDA00035561544900000910
The formula (2) is shown in the formula (12).
Figure BDA00035561544900000911
Wherein the content of the first and second substances,
Figure BDA00035561544900000912
is a sampling action; beta is a mixing coefficient;
Figure BDA00035561544900000913
a control action at time t + 1;
Figure BDA00035561544900000914
the control action at the time H;
Figure BDA00035561544900000915
the control action at the time H-1.
A2: an original trajectory sequence of N control trajectories is obtained.
Wherein the system state s of N control tracks is obtainedtAct of down-sampling
Figure BDA00035561544900000916
In a state stUsing the dynamic characteristic model fm(st,at) Can calculate the reward rtAnd the sameSystem state s at the next moment of the control trajectoryt+1. According to the sampling method of the single control track, the original track sequence T of the N control tracks is obtained by sampling with the current state s of the system as a starting point, wherein the original track sequence T is { T ═ T }1,…TN}, original trajectory sequence
Figure BDA00035561544900000917
Contains H state-action pairs, the expression of which is shown in equation (13).
Figure BDA00035561544900000918
Wherein, TnIs the nth track sequence;
Figure BDA00035561544900000919
a state-action pair at the time t in the nth track sequence; h is the length of the control track, and the value of H is an integer greater than 1; t is time, and the value of t is an integer from 0 to H-1; n is the number of original track sequences, and the value of N is an integer greater than or equal to 1.
A3: and selecting a target track sequence set meeting preset conditions from the original track sequences.
The track pruning method is based on dynamic uncertainty estimation. The original trajectory sequence T obtained by the above trajectory sampling method is ═ T { (T })1,…TNAnd (4) deleting the undesirable track sequences by using a dynamic uncertainty measuring method (a track pruning method) to obtain a target track sequence set meeting preset conditions.
The calculation formula of the measurement method of the dynamic uncertainty is shown as formula (14).
Figure BDA0003556154490000101
Where disc (s, a) is the dynamic uncertainty;
Figure BDA0003556154490000102
is a dynamic characteristic model, l belongs to 1, …, K, K is K dynamic characteristic models obtained by training; l can be any integer from 1 to K, i and j are uniformly referred to by l, and i and j can be any integer from 1 to K.
Calculating an original track sequence T ═ { T through a dynamic uncertainty measuring method1,…TNAnd for any one track sequence, if the uncertainty calculation value of the contained state-action pair is greater than a given uncertainty threshold value L, taking the whole track sequence from the original track sequence T to { T ═ T }1,…TNRemoving the target track sequence set T to obtain a target track sequence set Tf
The value of the uncertainty threshold L is set by a technician according to an actual situation, and is not specifically limited in the present application.
If the number of the residual track sequences after the operation is less than Nm(NmConstant less than N), the uncertainty calculations in all trajectory sequences are summed, the top N with the smaller uncertainty sum is selectedmTaking the strip track sequence as a target track sequence set Tf
A4: and carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence.
Specifically, the process of optimizing the track of each track in the target track sequence set to obtain the optimized action sequence is as follows:
firstly, the reward values of all tracks in the target track sequence set are summed to obtain a cumulative reward value.
Wherein, the target track sequence set T obtained by the method of track optimization of cumulative rewardfThe accumulated reward r of each step in each track is accumulated and summed to obtain the accumulated reward of each track sequence, and the expression of the accumulated reward of each track sequence is shown as formula (15).
Figure BDA0003556154490000111
Wherein R isfThe accumulated reward of each track sequence is contained in the track sequence; r1Set T for target track sequencefA jackpot for the first sequence of tracks;
Figure BDA0003556154490000112
set T for target track sequencefThe jackpot for the last track sequence.
For the target track sequence set T obtained abovefThe calculation formula for cumulatively summing the rewards r of each step in each track is shown in equation (16).
Figure BDA0003556154490000113
Wherein R isnSet T for target track sequencefThe cumulative sum of the rewards r for each step in each track; r istThe reward value at the t moment in the track; h is the track length; t is 0 to H-1; vb(sH) Is a state sHThe value of the following value function.
Then, the action of each track in the target track sequence set is weighted and calculated through the accumulated reward value, so as to obtain an optimized action sequence, and an expression of the optimized action sequence is shown as an expression (17).
Figure BDA0003556154490000114
Wherein the content of the first and second substances,
Figure BDA0003556154490000115
to optimize the sequence of actions; k is a weighting factor; rnSet T for target track sequencefThe reward r of each step in each track is accumulated and summed;
Figure BDA0003556154490000116
as a sequence of tracks Tn∈TfAn act of a tth time step; t isfA target track sequence set is obtained; h is the length of the control track; the value of n is an integer from 1 to the total number of target track sequences.
A5: and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.
Obtaining the optimal control action sequence of the system under the current state s through the formula (17)
Figure BDA0003556154490000117
And returns the action at the current time in the optimal control action sequence (the first action)
Figure BDA0003556154490000118
) As the optimal control action recommendation.
And when the recommended quantity of the optimized control action is carried out at the next moment of the system, the recommended quantity of the optimized control action can be obtained according to the calculation flow A1-A5.
S103: and executing the corresponding optimized control operation based on the optimized control action recommended quantity.
And different optimization control action recommendation quantities and different optimization control operations are executed.
For example, if the control system is a wind power generation control system and the recommended optimal control action amount is 1300r/min of the fan rotation speed, the optimal control operation is executed to control the fan rotation speed to be 1300 r/min.
Model training is carried out through the collected offline data set of the control system, an optimization model is built based on a finite time domain model prediction control framework, the optimization model is solved by using a track optimization method to obtain an optimized control action, and optimization control of the system is achieved.
Optionally, if the control task of the monitored control system changes, the recommended amount of the optimal control action is adjusted based on the target adaptive control strategy and/or the constraint control strategy.
For a certain control task target, a new scenario or control task target may be faced, that is, the new target is changed compared with the target optimized by the real behavior strategy in the stage of generating offline data, and in this case, the technical scheme still has strong adaptability to the new target.
To facilitate understanding of the process of the new target changing compared to the target optimized by the real behavior strategy in the generation of the offline data phase, the following examples are given:
for example, the initial control task goal is to improve the power generation efficiency of the wind power generation control system, and when the new goal is changed from the goal (initial control task goal) optimized by the real behavior strategy in the stage of generating the offline data, the temperature of the fan of the wind power generation control system is controlled within a certain temperature range while the power generation efficiency of the control system is improved.
The control optimization task target change can be mainly divided into two types: firstly, a target adaptation control strategy is adopted, namely, the original task target reward is adjusted, so that the control strategy is controlled and optimized according to a new target; and the other is a constraint control strategy, namely, a limiting condition aiming at the system state is added in the control process to constrain the control action, but the original task reward function is kept unchanged.
Target adaptive control strategy:
for new control objectives, a new reward function r is definednew=fobj(s, a) calculating a jackpot for the sampled sequence of tracks using the new reward function. Specifically, R 'is modified through calculation of track cumulative rewards'n=∑tfobj(st,at) And changes the calculation of the optimal action sequence as shown in equation (18).
Figure BDA0003556154490000121
And (3) constraint control strategy:
for a control task added with system state constraint, not only the original optimization target needs to be ensured, but also a control strategy capable of meeting the constraint condition needs to be searched. The technical scheme has two ways to adapt to the control situation:
adding a penalty based on the constraint condition to the original reward function. The penalty function defining the constraint is fp(s), then the new track sequence jackpot calculation is R'n=∑t(α·rt+(1-α)·fp(st)),rtThe original goal is awarded and the calculation of the optimal sequence of actions is varied as shown in equation (19).
Figure BDA0003556154490000131
Solving equation (19) results in a control strategy that is adaptive to the constraints.
And adding constraint punishment in the track pruning stage. Defining a penalty function f 'based on constraint conditions'p(s), the uncertainty estimate is adjusted as follows, and the adjustment calculation formula is shown in formula (20).
Figure BDA0003556154490000132
Constraint conditions are taken into consideration in the track pruning stage, and the purpose of adapting to constraint control can be achieved without changing the subsequent calculation process.
In the embodiment of the application, the training and learning of the optimization model are carried out through the off-line data set, and when a complex objective function and a nonlinear dynamical model are faced, the optimization control is carried out through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And an optimization model is constructed by adopting a model predictive control framework, so that even if a new control task target or a control task with additional constraint is faced, retraining and learning of the optimization model are not needed, and the adaptability and the control flexibility of optimization control are improved.
Referring to fig. 2, a process involved in S102 that the current state of the system is input to a pre-constructed optimization model based on a preset optimization strategy to perform optimization processing to obtain a recommended amount of an optimization control action mainly includes the following steps:
s201: carrying out track sampling based on a dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer of 1 or more.
S202: an original trajectory sequence of N control trajectories is obtained.
S203: and selecting a target track sequence set meeting preset conditions from the original track sequence.
S204: and carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence.
S205: and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.
The execution principle of S201-S205 is consistent with the execution principle of S102, and may be referred to herein, and is not described herein again.
In the embodiment of the application, each track in the target track sequence set is optimized through a track optimization method, so that the purpose of obtaining an optimized action sequence is achieved.
Referring to fig. 3, a process of performing trajectory optimization on each trajectory in the target trajectory sequence set to obtain an optimized action sequence in S204 mainly includes the following steps:
s301: and summing the reward values of all tracks in the target track sequence set to obtain a cumulative reward value.
S302: and performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward value to obtain an optimized action sequence.
The execution principle of S301-S302 is consistent with the execution principle of S204, and it can be referred to here, which is not described again.
In the embodiment of the application, the reward values of all tracks in the target track sequence set are summed to obtain the accumulated reward value, and the action of all tracks in the target track sequence set is weighted and calculated through the accumulated reward value to achieve the purpose of obtaining the optimized action sequence.
Based on the optimization control method disclosed in fig. 1 in the foregoing embodiment, an optimization control apparatus is correspondingly disclosed in the embodiment of the present application, and as shown in fig. 4, the optimization control apparatus includes an obtaining unit 401, a processing unit 402, and an executing unit 403.
An obtaining unit 401 is configured to obtain a system current state of the control system.
The processing unit 402 is configured to input a current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy, so as to obtain an optimization control action recommendation amount; and the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework.
An executing unit 403, configured to execute a corresponding optimal control operation based on the optimal control action recommendation amount.
Further, the processing unit 402 includes a sampling module, a first obtaining module, a first selecting module, an optimizing module, and a second selecting module.
The sampling module is used for carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer of 1 or more.
The first acquisition module is used for acquiring an original track sequence of the N control tracks.
The first selection module is used for selecting a target track sequence set which meets preset conditions from the original track sequence.
And the optimization module is used for carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence.
And the second selection module is used for selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.
Further, the optimization module includes a summation sub-module and a calculation sub-module.
And the summation submodule is used for summing the reward values of all the tracks in the target track sequence set to obtain the accumulated reward value.
And the calculation submodule is used for performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward values to obtain an optimized action sequence.
Further, the processing unit 402 of the building process of the dynamic characteristics model includes a second obtaining module and a first building module.
The second acquisition module is used for acquiring an offline data set of the control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.
The first construction module is used for constructing a dynamic characteristic model through the deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on an off-line data set.
Further, the processing unit 402 of the construction process of the behavior policy model includes a third obtaining module and a second constructing module.
The third acquisition module is used for acquiring an offline data set of the control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system in a preset history.
And the second construction module is used for constructing the behavior strategy model through the deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on the off-line data set.
Further, the processing unit 402 of the construction process of the action value function model includes a fourth obtaining module and a third constructing module.
The fourth acquisition module is used for acquiring an offline data set of the control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system in a preset history.
A third building module for building the action value function model from the offline dataset and the fitted cost function assessment FQE.
Further, the optimization control device further comprises an adjusting unit.
And the adjusting unit is used for adjusting the recommended quantity of the optimized control action based on the target adaptive control strategy and/or the constraint control strategy if the control task of the monitored control system changes.
In the embodiment of the application, the training and learning of the optimization model are carried out through the off-line data set, and when a complex objective function and a nonlinear dynamical model are faced, the optimization control is carried out through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And moreover, the model predictive control framework is adopted to construct the optimization model, so that even if a new control task target or a control task with additional constraint is faced, retraining and learning of the optimization model are not needed, and the adaptability and the control flexibility of the optimization control method are improved.
The embodiment of the application also provides a storage medium, wherein the storage medium comprises stored instructions, and when the instructions are executed, the equipment where the storage medium is located is controlled to execute the optimization control method.
The embodiment of the present application further provides an electronic device, whose schematic structural diagram is shown in fig. 5, specifically including a memory 501 and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and are configured to be executed by one or more processors 503 to execute the one or more instructions 502 to perform the above-mentioned optimization control method.
The specific implementation procedures and derivatives thereof of the above embodiments are within the scope of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is only a preferred embodiment of the present application and it should be noted that, as will be apparent to those skilled in the art, numerous modifications and adaptations can be made without departing from the principles of the present application and such modifications and adaptations are intended to be considered within the scope of the present application.

Claims (10)

1. An optimization control method, characterized in that the method comprises:
acquiring the current system state of a control system;
inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain an optimization control action recommendation quantity; the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework;
and executing corresponding optimized control operation based on the optimized control action recommended quantity.
2. The method according to claim 1, wherein the inputting the current state of the system into a pre-constructed optimization model for processing based on a preset optimization strategy to obtain a recommended amount of optimization control actions comprises:
carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer greater than or equal to 1;
acquiring an original track sequence of the N control tracks;
selecting a target track sequence set which meets a preset condition from the original track sequence;
carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence;
and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.
3. The method according to claim 2, wherein the performing trajectory optimization on each trajectory in the target trajectory sequence set to obtain an optimized action sequence comprises:
summing the reward values of all tracks in the target track sequence set to obtain a cumulative reward value;
and performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward value to obtain an optimized action sequence.
4. The method of claim 1, wherein the dynamic characteristics model is constructed by a process comprising:
acquiring an offline data set of a control system; the off-line data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
and constructing the dynamic characteristic model through a deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on the off-line data set.
5. The method of claim 1, wherein the behavior strategy model is constructed by a process comprising:
acquiring an offline data set of a control system; the offline data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
and constructing the behavior strategy model through a deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on the off-line data set.
6. The method of claim 1, wherein the act-value function model is constructed by:
acquiring an offline data set of a control system; the offline data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
the action value function model is constructed from the offline data set and the fitted cost function estimate FQE.
7. The method of claim 1, further comprising:
and if the monitored control task of the control system changes, adjusting the recommended quantity of the optimized control action based on a target adaptive control strategy and/or a constraint control strategy.
8. An optimization control apparatus, characterized in that the apparatus comprises:
the acquisition unit is used for acquiring the current system state of the control system;
the processing unit is used for inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain the recommended quantity of the optimization control action; the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework;
and the execution unit is used for executing corresponding optimized control operation based on the optimized control action recommended quantity.
9. A storage medium comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium resides to perform the optimization control method of any one of claims 1 to 7.
10. An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the optimization control method of any of claims 1-7.
CN202210277509.5A 2022-03-21 2022-03-21 Optimization control method and device, storage medium and electronic equipment Pending CN114625091A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210277509.5A CN114625091A (en) 2022-03-21 2022-03-21 Optimization control method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210277509.5A CN114625091A (en) 2022-03-21 2022-03-21 Optimization control method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN114625091A true CN114625091A (en) 2022-06-14

Family

ID=81904254

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210277509.5A Pending CN114625091A (en) 2022-03-21 2022-03-21 Optimization control method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114625091A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950690A (en) * 2019-05-15 2020-11-17 天津科技大学 Efficient reinforcement learning strategy model with self-adaptive capacity
CN112884130A (en) * 2021-03-16 2021-06-01 浙江工业大学 SeqGAN-based deep reinforcement learning data enhanced defense method and device
CN113363997A (en) * 2021-05-28 2021-09-07 浙江大学 Reactive voltage control method based on multi-time scale and multi-agent deep reinforcement learning
CN113392935A (en) * 2021-07-09 2021-09-14 浙江工业大学 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism
CN113759708A (en) * 2021-02-09 2021-12-07 京东城市(北京)数字科技有限公司 System optimization control method and device and electronic equipment
CN113885607A (en) * 2021-10-20 2022-01-04 京东城市(北京)数字科技有限公司 Steam temperature control method and device, electronic equipment and computer storage medium
CN113935463A (en) * 2021-09-30 2022-01-14 南方电网数字电网研究院有限公司 Microgrid controller based on artificial intelligence control method
CN114011564A (en) * 2021-10-22 2022-02-08 内蒙古京能康巴什热电有限公司 Coal mill control optimization method based on model offline planning
CN114065452A (en) * 2021-11-17 2022-02-18 国家电网有限公司华东分部 Power grid topology optimization and power flow control method based on deep reinforcement learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950690A (en) * 2019-05-15 2020-11-17 天津科技大学 Efficient reinforcement learning strategy model with self-adaptive capacity
CN113759708A (en) * 2021-02-09 2021-12-07 京东城市(北京)数字科技有限公司 System optimization control method and device and electronic equipment
CN112884130A (en) * 2021-03-16 2021-06-01 浙江工业大学 SeqGAN-based deep reinforcement learning data enhanced defense method and device
CN113363997A (en) * 2021-05-28 2021-09-07 浙江大学 Reactive voltage control method based on multi-time scale and multi-agent deep reinforcement learning
CN113392935A (en) * 2021-07-09 2021-09-14 浙江工业大学 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism
CN113935463A (en) * 2021-09-30 2022-01-14 南方电网数字电网研究院有限公司 Microgrid controller based on artificial intelligence control method
CN113885607A (en) * 2021-10-20 2022-01-04 京东城市(北京)数字科技有限公司 Steam temperature control method and device, electronic equipment and computer storage medium
CN114011564A (en) * 2021-10-22 2022-02-08 内蒙古京能康巴什热电有限公司 Coal mill control optimization method based on model offline planning
CN114065452A (en) * 2021-11-17 2022-02-18 国家电网有限公司华东分部 Power grid topology optimization and power flow control method based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
Kumar et al. A deep learning architecture for predictive control
US9298172B2 (en) Method and apparatus for improved reward-based learning using adaptive distance metrics
JP5448841B2 (en) Method for computer-aided closed-loop control and / or open-loop control of technical systems, in particular gas turbines
Arif et al. Incorporation of experience in iterative learning controllers using locally weighted learning
Goulart et al. Autonomous pH control by reinforcement learning for electroplating industry wastewater
CN111260124A (en) Chaos time sequence prediction method based on attention mechanism deep learning
EP3704550B1 (en) Generation of a control system for a target system
Reyes-Reyes et al. Bounded neuro-control position regulation for a geared DC motor
CN112445136B (en) Thickener prediction control method and system based on continuous time neural network
Marwala Finite-element-model Updating Using the Response-surface Method
CN111930010A (en) LSTM network-based general MFA controller design method
Marusak A numerically efficient fuzzy MPC algorithm with fast generation of the control signal
Agyeman et al. LSTM-based model predictive control with discrete actuators for irrigation scheduling
Serrano-Pérez et al. Offline robust tuning of the motion control for omnidirectional mobile robots
Saber et al. Real-time optimization for an AVR system using enhanced Harris Hawk and IIoT
CN110454322B (en) Water turbine speed regulation control method, device and system based on multivariable dynamic matrix
CN114625091A (en) Optimization control method and device, storage medium and electronic equipment
Kordon Hybrid intelligent systems for industrial data analysis
CN111356959B (en) Method for computer-aided control of a technical system
CN114839861A (en) Intelligent PID controller online optimization method and system
JP2023106043A (en) Driving assist system, driving assist method, and program
Yang et al. Intelligent Forecasting System Using Grey Model Combined with Neural Network.
Kocijan et al. Application of Gaussian processes to the modelling and control in process engineering
Kumar et al. Architecture, performance and stability analysis of a formula-based fuzzy I− fuzzy P− fuzzy D controller
Hachiya et al. Efficient sample reuse in EM-based policy search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination