CN114625091A - Optimization control method and device, storage medium and electronic equipment - Google Patents
Optimization control method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN114625091A CN114625091A CN202210277509.5A CN202210277509A CN114625091A CN 114625091 A CN114625091 A CN 114625091A CN 202210277509 A CN202210277509 A CN 202210277509A CN 114625091 A CN114625091 A CN 114625091A
- Authority
- CN
- China
- Prior art keywords
- control
- optimization
- model
- action
- constructed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005457 optimization Methods 0.000 title claims abstract description 148
- 238000000034 method Methods 0.000 title claims abstract description 85
- 238000003860 storage Methods 0.000 title claims abstract description 14
- 230000006870 function Effects 0.000 claims abstract description 46
- 238000012549 training Methods 0.000 claims abstract description 20
- 230000009471 action Effects 0.000 claims description 134
- 230000006399 behavior Effects 0.000 claims description 41
- 230000008569 process Effects 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 20
- 238000005070 sampling Methods 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000011217 control strategy Methods 0.000 claims description 15
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 230000001186 cumulative effect Effects 0.000 claims description 6
- 238000010248 power generation Methods 0.000 description 10
- 230000000875 corresponding effect Effects 0.000 description 9
- 238000010276 construction Methods 0.000 description 7
- 238000009826 distribution Methods 0.000 description 7
- 238000013138 pruning Methods 0.000 description 6
- 239000000126 substance Substances 0.000 description 6
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 6
- 230000006978 adaptation Effects 0.000 description 3
- 230000001276 controlling effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000005312 nonlinear dynamic Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 101000912561 Bos taurus Fibrinogen gamma-B chain Proteins 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000003889 chemical engineering Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000004134 energy conservation Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000005272 metallurgy Methods 0.000 description 1
- 238000004886 process control Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B19/00—Programme-control systems
- G05B19/02—Programme-control systems electric
- G05B19/418—Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM]
- G05B19/41885—Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM] characterised by modeling, simulation of the manufacturing system
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B2219/00—Program-control systems
- G05B2219/30—Nc systems
- G05B2219/32—Operator till task planning
- G05B2219/32339—Object oriented modeling, design, analysis, implementation, simulation language
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/02—Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]
Landscapes
- Engineering & Computer Science (AREA)
- Manufacturing & Machinery (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The application discloses an optimization control method, an optimization control device, a storage medium and electronic equipment. Based on the above, training and learning of the optimization model are performed through the offline data set, and when a complex objective function and a nonlinear dynamical model are faced, optimization control is performed through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And moreover, the model predictive control framework is adopted to construct the optimization model, so that even if a new control task target or a control task with additional constraint is faced, retraining and learning of the optimization model are not needed, and the adaptability and the control flexibility of the optimization control method are improved.
Description
Technical Field
The present application relates to the field of automatic control technologies, and in particular, to an optimization control method, apparatus, storage medium, and electronic device.
Background
In the production links of various industries, a large number of links of system operation control exist: such as robot control, automatic operation systems of agricultural machinery, automatic control links of intelligent manufacturing, various operation control systems in the fields of energy, chemical engineering, metallurgy and the like in the industrial industry, and the like. By optimizing and controlling the control systems, the utilization efficiency of resources can be improved, the waste of time, materials, energy and the like is reduced, the competitiveness of the industrial industry is improved, the green development target of energy conservation and emission reduction is realized, and the method has great significance for the development and progress of the industry.
Conventional optimization Control methods in industrial Control applications, such as Proportional-Integral-Derivative (PID) controllers and Model Predictive Control (MPC) controllers.
In the optimization problem of a complex control system, the traditional optimization control method has poor effect. On one hand, the solving capability of the traditional optimization control method limits the optimization effect of the traditional optimization control method when the traditional optimization control method faces increasingly complex control systems; on the other hand, the traditional optimization control method lacks effective utilization of mass data precipitated in a control system, and seriously depends on human experience, theoretical derivation or simulation environment consistent with real environment during model design, and in the face of complex targets and nonlinear dynamic models, the solution is difficult and inefficient, so that the control method lacks universality. In addition, a large amount of computing resources are needed in the learning process of the control system offline strategy, and the application is limited in the scene of limited computing resources; the method is lack of adaptability to the change of control task targets and poor in control flexibility.
Therefore, the conventional optimization control method has poor generality and poor control flexibility.
Disclosure of Invention
In view of this, the present application discloses an optimization control method, an optimization control device, a storage medium, and an electronic apparatus, which aim to improve the versatility, the adaptability, and the control flexibility of the optimization control method.
In order to achieve the purpose, the technical scheme is as follows:
the first aspect of the present application discloses an optimization control method, including:
acquiring the current system state of a control system;
inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain an optimization control action recommendation quantity; the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework;
and executing corresponding optimized control operation based on the optimized control action recommended quantity.
Preferably, the inputting the current state of the system into a pre-constructed optimization model for processing based on a preset optimization strategy to obtain a recommended amount of the optimization control action includes:
carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer greater than or equal to 1;
acquiring an original track sequence of the N control tracks;
selecting a target track sequence set which meets a preset condition from the original track sequence;
carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence;
and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.
Preferably, the performing the trajectory optimization on each trajectory in the target trajectory sequence set to obtain an optimized action sequence includes:
summing the reward values of all tracks in the target track sequence set to obtain an accumulated reward value;
and performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward value to obtain an optimized action sequence.
Preferably, the process of constructing the dynamic characteristics model includes:
acquiring an offline data set of a control system; the offline data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
and constructing the dynamic characteristic model through a deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on the off-line data set.
Preferably, the construction process of the behavior strategy model includes:
acquiring an offline data set of a control system; the offline data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
and constructing the behavior strategy model through a deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on the off-line data set.
Preferably, the process of constructing the action value function model includes:
acquiring an offline data set of a control system; the off-line data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
the action value function model is constructed from the offline data set and the fitted cost function estimate FQE.
Preferably, the method further comprises the following steps:
and if the monitored control task of the control system changes, adjusting the recommended quantity of the optimized control action based on a target adaptive control strategy and/or a constraint control strategy.
A second aspect of the present application discloses an optimization control apparatus, the apparatus comprising:
the acquisition unit is used for acquiring the current system state of the control system;
the processing unit is used for inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain the recommended quantity of the optimization control action; the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework;
and the execution unit is used for executing corresponding optimized control operation based on the optimized control action recommended quantity.
A third aspect of the present application discloses a storage medium, which includes stored instructions, wherein when the instructions are executed, a device in which the storage medium is located is controlled to execute the optimization control method according to any one of the first aspect.
A fourth aspect of the present application discloses an electronic device, comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the optimization control method according to any one of the first aspect.
According to the technical scheme, the optimization control method, the optimization control device, the storage medium and the electronic equipment are used for obtaining the current system state of the control system, inputting the current system state to the pre-constructed optimization model based on the preset optimization strategy for processing to obtain the optimization control action recommendation quantity, jointly constructing the pre-constructed dynamic characteristic model, the pre-constructed behavior strategy model and the pre-constructed action value function model through the model prediction control framework by the optimization model, and executing the corresponding optimization control operation based on the optimization control action recommendation quantity. Based on the scheme, the training and learning of the optimization model are carried out through the off-line data set, and when a complex objective function and a nonlinear dynamic model are faced, the optimization control is carried out through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And moreover, the model predictive control framework is adopted to construct the optimization model, and even if a new control task target or a control task with added constraint is faced, the optimization model does not need to be retrained and learned, so that the adaptability and the control flexibility of the optimization control are improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of an optimization control method disclosed in an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating obtaining recommended optimal control actions according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart illustrating an optimized action sequence disclosed in an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an optimization control device disclosed in an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As can be seen from the background art, the existing optimization control method has poor versatility and poor control flexibility.
In order to solve the above problems, embodiments of the present application disclose an optimization control method, an optimization control device, a storage medium, and an electronic device, which achieve the purpose of improving the versatility, the adaptability, and the control flexibility of the optimization control. The specific implementation is specifically illustrated by the following examples.
Referring to fig. 1, a schematic flow chart of an optimization control method disclosed in an embodiment of the present application is shown, where the optimization control method mainly includes the following steps:
s101: and acquiring the current system state of the control system.
In S101, the control system includes a robot control system, an agricultural machine automatic operation control system, a thermal power generation boiler control system, a wind power generation control system, and the like.
The current state of the system is used to indicate characteristics of the current operating state of the system controlling the system.
The current state of the system is different for different control systems. If the control system is a thermal power generation control system, the current state of the system includes a temperature state of wind, a pressure state of wind, a temperature state of water, a pressure state of water, and the like.
S102: inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain an optimization control action recommendation quantity; and the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework.
In S102, the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model, and a pre-constructed action value function model through a finite time domain model predictive control framework.
The specific dynamic characteristic model is constructed as follows:
firstly, acquiring an offline data set of a control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.
The offline data sets corresponding to different control systems are different, and if the control system is a thermal power generation control system, the offline data sets include temperature history data of wind, pressure history data of wind, temperature history data of water, pressure history data of water, and the like.
The control systems are different in corresponding system characteristic data, and if the control system is a thermal power generation control system, the system characteristic data includes temperature data, wind pressure data, water temperature data, water pressure data, and the like.
The preset history period is a period of time during which a history data record exists before the current time. The predetermined historical period may be 1 to 2 years or several months, determined primarily by the time of accumulation of historical data in the control system.
And then, constructing a dynamic characteristic model through the deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on an off-line data set.
The dynamic characteristic model is constructed and trained by adopting a deep neural network based on an off-line data set of a collection control system to obtain the dynamic characteristic model, and the expression of the dynamic characteristic model is shown as a formula (1).
(rt,st+1)=fm(st,at) (1)
Wherein r istA value for the currently acquired reward; st+1The state of the system at the next moment; stThe current state of the system; a istFor the control action to be carried out, fmIs a dynamic behavior model.
The input of the dynamic characteristic model is the current state s of the systemtAnd the control action a executedtThe output is the currently obtained reward value rtAnd the state s of the next moment of the systemt+1. Through initializing different model parameters of the dynamic characteristic model, training to obtain K dynamic characteristicsAnd in the sexual model, the value of K is an integer which is more than or equal to 1, and the expressive force of the dynamic characteristic model is further enhanced by integrating the output results of the K dynamic characteristic models (the output results of all the dynamic characteristic models are averaged or the result of one model is randomly selected as output).
The construction process of the specific behavior strategy model is as follows:
acquiring an offline data set of a control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.
And constructing a behavior strategy model through the deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on an off-line data set.
The behavior strategy model is constructed and trained by adopting a deep neural network based on the collected control system offline data set, and the expression of the behavior strategy model is shown as a formula (2).
at=fb(st) (2)
Wherein, atA control action to be performed; f. ofb(st) Is a behavioral policy model.
The input of the behavior strategy model is the current state s of the systemtOutput as an action performed at. The K behavior strategy models are obtained through initializing different model parameters of the behavior strategy models, training, and the expressive force of the behavior strategy models is further enhanced through integrating output results of the K models (averaging the output results of all the behavior strategy models or randomly selecting the result of one model as output).
The construction process of the action value function model is as follows:
acquiring an offline data set of a control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.
And (3) constructing an action value function model through the offline data set and the Fitted value function Evaluation (FQE), wherein the expression of the action value function model is shown as a formula (3).
Wherein the content of the first and second substances,as a function of the action value; f(s)i,ai) A model is a function of the action value in training; f is the selected action value function model class; y isiIs a training target; the value of N is the data volume of the training data set; the value of i is an integer from 1 to N.
yiThe formula (4) is shown below.
Wherein r isiIs a prize value; gamma 'is a reward discount factor, and gamma' is less than 1;the action value function obtained for the last training iteration is in(s)i+1,ai+1) The value of (d); siThe state at the moment i; a isiThe action at the moment i; si+1The state at the moment i + 1; a isi+1The action at the moment i + 1; and B is an offline data set.
Equation (5) can be further estimated by the action value function.
Wherein, Vb(st) As a function of value;expressing the expectation of the action value function under the action sampled according to the action strategy; q(s)tAnd a) is a function of the action value.
The preset optimization strategies comprise a track sampling method, a track pruning method and a track optimization method. Inputting the current state of the system into a pre-constructed optimization model for optimization processing through a track sampling method, a track pruning method and a track optimization method to obtain an optimal control action sequenceThen controlling the action sequence from the optimumPicking up actionAs the optimal control action recommendation.
And (3) constructing an optimization model:
and (3) jointly constructing a dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a finite time domain model predictive control framework to obtain an optimization model, wherein the expression of the optimization model is shown as a formula (6).
Wherein r istThe reward value at the moment t; vb(sH) Is a state sHThe value of the following value function. The constraint condition of the optimization problem is the dynamic characteristic model (r)t,st+1)=fm(st,at)。
At each step, the current state s of the system is given0By solving the above finite time domain optimization problem, an optimal control action sequence of length H can be obtained, where s0=sinit,sinitIs given as the characteristic value of the current state of the system.
Specifically, the current state of the system is input to a pre-constructed optimization model for processing based on a preset optimization strategy, and a process of optimizing the recommended quantity of the control action is obtained, as shown in A1-A5.
A1: carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer of 1 or more.
Wherein, the state s of the system at the current time is used as a starting point, and a dynamic characteristic model f is utilizedmAnd behavior strategy model fbAnd carrying out track sampling, and simulating to generate N control tracks with the length of H, wherein the value of H is an integer greater than 1, and the sampling method is a track sampling method under the guidance of an action value function.
Taking the generation of one of the trajectories as an example, at the t-th time step on the control trajectory, the dynamic characteristic model f is usedmThe predicted current state of the system is stDefine the behavior strategy f at this timeb(st) Mean value μ of the generated motion distributiona(st) And standard deviation sigmaa(st)。
Behavior strategy fb(st) Mean value μ of the generated motion distributiona(st) The expression of (c) is shown in equation (7).
Wherein the content of the first and second substances,including the mean of each dimension of the motion distribution, i.e.A is the motion space and T is the transposed symbol.
Behavior strategy fb(st) Standard deviation σ of the generated motion distributiona(st) The expression of (c) is shown in equation (8).
Wherein the content of the first and second substances,the standard deviation of each dimension of the motion distribution is included,
at time t, the control action is sampled by equations 9, 10 and 11.
Wherein the content of the first and second substances,for the action obtained by sampling, N represents a normal distribution and a constant σMScaling the standard deviation of the distribution of each dimension of motion by a standard deviation scaling factor to adjust the aggressiveness, σ, of the motion sampleM>0。
Wherein M istA set of m sampled actions; t is the time; h is the length of the control track.
Wherein Q isb(stA) is a function of action value;to be MtSubstituting each action into action value function Qb(stA) corresponding actions when the maximum result is obtained after calculation respectively; h is the length of the control track.
Selecting an action value function Qb(stA) maximum actionIs transported last timeCarrying out the optimal control sequence obtained by solving the solving processControl action of corresponding timeAnd mixing according to the mixing coefficient beta to obtain sampling action The formula (2) is shown in the formula (12).
Wherein the content of the first and second substances,is a sampling action; beta is a mixing coefficient;a control action at time t + 1;the control action at the time H;the control action at the time H-1.
A2: an original trajectory sequence of N control trajectories is obtained.
Wherein the system state s of N control tracks is obtainedtAct of down-samplingIn a state stUsing the dynamic characteristic model fm(st,at) Can calculate the reward rtAnd the sameSystem state s at the next moment of the control trajectoryt+1. According to the sampling method of the single control track, the original track sequence T of the N control tracks is obtained by sampling with the current state s of the system as a starting point, wherein the original track sequence T is { T ═ T }1,…TN}, original trajectory sequenceContains H state-action pairs, the expression of which is shown in equation (13).
Wherein, TnIs the nth track sequence;a state-action pair at the time t in the nth track sequence; h is the length of the control track, and the value of H is an integer greater than 1; t is time, and the value of t is an integer from 0 to H-1; n is the number of original track sequences, and the value of N is an integer greater than or equal to 1.
A3: and selecting a target track sequence set meeting preset conditions from the original track sequences.
The track pruning method is based on dynamic uncertainty estimation. The original trajectory sequence T obtained by the above trajectory sampling method is ═ T { (T })1,…TNAnd (4) deleting the undesirable track sequences by using a dynamic uncertainty measuring method (a track pruning method) to obtain a target track sequence set meeting preset conditions.
The calculation formula of the measurement method of the dynamic uncertainty is shown as formula (14).
Where disc (s, a) is the dynamic uncertainty;is a dynamic characteristic model, l belongs to 1, …, K, K is K dynamic characteristic models obtained by training; l can be any integer from 1 to K, i and j are uniformly referred to by l, and i and j can be any integer from 1 to K.
Calculating an original track sequence T ═ { T through a dynamic uncertainty measuring method1,…TNAnd for any one track sequence, if the uncertainty calculation value of the contained state-action pair is greater than a given uncertainty threshold value L, taking the whole track sequence from the original track sequence T to { T ═ T }1,…TNRemoving the target track sequence set T to obtain a target track sequence set Tf。
The value of the uncertainty threshold L is set by a technician according to an actual situation, and is not specifically limited in the present application.
If the number of the residual track sequences after the operation is less than Nm(NmConstant less than N), the uncertainty calculations in all trajectory sequences are summed, the top N with the smaller uncertainty sum is selectedmTaking the strip track sequence as a target track sequence set Tf。
A4: and carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence.
Specifically, the process of optimizing the track of each track in the target track sequence set to obtain the optimized action sequence is as follows:
firstly, the reward values of all tracks in the target track sequence set are summed to obtain a cumulative reward value.
Wherein, the target track sequence set T obtained by the method of track optimization of cumulative rewardfThe accumulated reward r of each step in each track is accumulated and summed to obtain the accumulated reward of each track sequence, and the expression of the accumulated reward of each track sequence is shown as formula (15).
Wherein R isfThe accumulated reward of each track sequence is contained in the track sequence; r1Set T for target track sequencefA jackpot for the first sequence of tracks;set T for target track sequencefThe jackpot for the last track sequence.
For the target track sequence set T obtained abovefThe calculation formula for cumulatively summing the rewards r of each step in each track is shown in equation (16).
Wherein R isnSet T for target track sequencefThe cumulative sum of the rewards r for each step in each track; r istThe reward value at the t moment in the track; h is the track length; t is 0 to H-1; vb(sH) Is a state sHThe value of the following value function.
Then, the action of each track in the target track sequence set is weighted and calculated through the accumulated reward value, so as to obtain an optimized action sequence, and an expression of the optimized action sequence is shown as an expression (17).
Wherein the content of the first and second substances,to optimize the sequence of actions; k is a weighting factor; rnSet T for target track sequencefThe reward r of each step in each track is accumulated and summed;as a sequence of tracks Tn∈TfAn act of a tth time step; t isfA target track sequence set is obtained; h is the length of the control track; the value of n is an integer from 1 to the total number of target track sequences.
A5: and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.
Obtaining the optimal control action sequence of the system under the current state s through the formula (17)And returns the action at the current time in the optimal control action sequence (the first action)) As the optimal control action recommendation.
And when the recommended quantity of the optimized control action is carried out at the next moment of the system, the recommended quantity of the optimized control action can be obtained according to the calculation flow A1-A5.
S103: and executing the corresponding optimized control operation based on the optimized control action recommended quantity.
And different optimization control action recommendation quantities and different optimization control operations are executed.
For example, if the control system is a wind power generation control system and the recommended optimal control action amount is 1300r/min of the fan rotation speed, the optimal control operation is executed to control the fan rotation speed to be 1300 r/min.
Model training is carried out through the collected offline data set of the control system, an optimization model is built based on a finite time domain model prediction control framework, the optimization model is solved by using a track optimization method to obtain an optimized control action, and optimization control of the system is achieved.
Optionally, if the control task of the monitored control system changes, the recommended amount of the optimal control action is adjusted based on the target adaptive control strategy and/or the constraint control strategy.
For a certain control task target, a new scenario or control task target may be faced, that is, the new target is changed compared with the target optimized by the real behavior strategy in the stage of generating offline data, and in this case, the technical scheme still has strong adaptability to the new target.
To facilitate understanding of the process of the new target changing compared to the target optimized by the real behavior strategy in the generation of the offline data phase, the following examples are given:
for example, the initial control task goal is to improve the power generation efficiency of the wind power generation control system, and when the new goal is changed from the goal (initial control task goal) optimized by the real behavior strategy in the stage of generating the offline data, the temperature of the fan of the wind power generation control system is controlled within a certain temperature range while the power generation efficiency of the control system is improved.
The control optimization task target change can be mainly divided into two types: firstly, a target adaptation control strategy is adopted, namely, the original task target reward is adjusted, so that the control strategy is controlled and optimized according to a new target; and the other is a constraint control strategy, namely, a limiting condition aiming at the system state is added in the control process to constrain the control action, but the original task reward function is kept unchanged.
Target adaptive control strategy:
for new control objectives, a new reward function r is definednew=fobj(s, a) calculating a jackpot for the sampled sequence of tracks using the new reward function. Specifically, R 'is modified through calculation of track cumulative rewards'n=∑tfobj(st,at) And changes the calculation of the optimal action sequence as shown in equation (18).
And (3) constraint control strategy:
for a control task added with system state constraint, not only the original optimization target needs to be ensured, but also a control strategy capable of meeting the constraint condition needs to be searched. The technical scheme has two ways to adapt to the control situation:
adding a penalty based on the constraint condition to the original reward function. The penalty function defining the constraint is fp(s), then the new track sequence jackpot calculation is R'n=∑t(α·rt+(1-α)·fp(st)),rtThe original goal is awarded and the calculation of the optimal sequence of actions is varied as shown in equation (19).
Solving equation (19) results in a control strategy that is adaptive to the constraints.
And adding constraint punishment in the track pruning stage. Defining a penalty function f 'based on constraint conditions'p(s), the uncertainty estimate is adjusted as follows, and the adjustment calculation formula is shown in formula (20).
Constraint conditions are taken into consideration in the track pruning stage, and the purpose of adapting to constraint control can be achieved without changing the subsequent calculation process.
In the embodiment of the application, the training and learning of the optimization model are carried out through the off-line data set, and when a complex objective function and a nonlinear dynamical model are faced, the optimization control is carried out through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And an optimization model is constructed by adopting a model predictive control framework, so that even if a new control task target or a control task with additional constraint is faced, retraining and learning of the optimization model are not needed, and the adaptability and the control flexibility of optimization control are improved.
Referring to fig. 2, a process involved in S102 that the current state of the system is input to a pre-constructed optimization model based on a preset optimization strategy to perform optimization processing to obtain a recommended amount of an optimization control action mainly includes the following steps:
s201: carrying out track sampling based on a dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer of 1 or more.
S202: an original trajectory sequence of N control trajectories is obtained.
S203: and selecting a target track sequence set meeting preset conditions from the original track sequence.
S204: and carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence.
S205: and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.
The execution principle of S201-S205 is consistent with the execution principle of S102, and may be referred to herein, and is not described herein again.
In the embodiment of the application, each track in the target track sequence set is optimized through a track optimization method, so that the purpose of obtaining an optimized action sequence is achieved.
Referring to fig. 3, a process of performing trajectory optimization on each trajectory in the target trajectory sequence set to obtain an optimized action sequence in S204 mainly includes the following steps:
s301: and summing the reward values of all tracks in the target track sequence set to obtain a cumulative reward value.
S302: and performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward value to obtain an optimized action sequence.
The execution principle of S301-S302 is consistent with the execution principle of S204, and it can be referred to here, which is not described again.
In the embodiment of the application, the reward values of all tracks in the target track sequence set are summed to obtain the accumulated reward value, and the action of all tracks in the target track sequence set is weighted and calculated through the accumulated reward value to achieve the purpose of obtaining the optimized action sequence.
Based on the optimization control method disclosed in fig. 1 in the foregoing embodiment, an optimization control apparatus is correspondingly disclosed in the embodiment of the present application, and as shown in fig. 4, the optimization control apparatus includes an obtaining unit 401, a processing unit 402, and an executing unit 403.
An obtaining unit 401 is configured to obtain a system current state of the control system.
The processing unit 402 is configured to input a current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy, so as to obtain an optimization control action recommendation amount; and the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework.
An executing unit 403, configured to execute a corresponding optimal control operation based on the optimal control action recommendation amount.
Further, the processing unit 402 includes a sampling module, a first obtaining module, a first selecting module, an optimizing module, and a second selecting module.
The sampling module is used for carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer of 1 or more.
The first acquisition module is used for acquiring an original track sequence of the N control tracks.
The first selection module is used for selecting a target track sequence set which meets preset conditions from the original track sequence.
And the optimization module is used for carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence.
And the second selection module is used for selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.
Further, the optimization module includes a summation sub-module and a calculation sub-module.
And the summation submodule is used for summing the reward values of all the tracks in the target track sequence set to obtain the accumulated reward value.
And the calculation submodule is used for performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward values to obtain an optimized action sequence.
Further, the processing unit 402 of the building process of the dynamic characteristics model includes a second obtaining module and a first building module.
The second acquisition module is used for acquiring an offline data set of the control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.
The first construction module is used for constructing a dynamic characteristic model through the deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on an off-line data set.
Further, the processing unit 402 of the construction process of the behavior policy model includes a third obtaining module and a second constructing module.
The third acquisition module is used for acquiring an offline data set of the control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system in a preset history.
And the second construction module is used for constructing the behavior strategy model through the deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on the off-line data set.
Further, the processing unit 402 of the construction process of the action value function model includes a fourth obtaining module and a third constructing module.
The fourth acquisition module is used for acquiring an offline data set of the control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system in a preset history.
A third building module for building the action value function model from the offline dataset and the fitted cost function assessment FQE.
Further, the optimization control device further comprises an adjusting unit.
And the adjusting unit is used for adjusting the recommended quantity of the optimized control action based on the target adaptive control strategy and/or the constraint control strategy if the control task of the monitored control system changes.
In the embodiment of the application, the training and learning of the optimization model are carried out through the off-line data set, and when a complex objective function and a nonlinear dynamical model are faced, the optimization control is carried out through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And moreover, the model predictive control framework is adopted to construct the optimization model, so that even if a new control task target or a control task with additional constraint is faced, retraining and learning of the optimization model are not needed, and the adaptability and the control flexibility of the optimization control method are improved.
The embodiment of the application also provides a storage medium, wherein the storage medium comprises stored instructions, and when the instructions are executed, the equipment where the storage medium is located is controlled to execute the optimization control method.
The embodiment of the present application further provides an electronic device, whose schematic structural diagram is shown in fig. 5, specifically including a memory 501 and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and are configured to be executed by one or more processors 503 to execute the one or more instructions 502 to perform the above-mentioned optimization control method.
The specific implementation procedures and derivatives thereof of the above embodiments are within the scope of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is only a preferred embodiment of the present application and it should be noted that, as will be apparent to those skilled in the art, numerous modifications and adaptations can be made without departing from the principles of the present application and such modifications and adaptations are intended to be considered within the scope of the present application.
Claims (10)
1. An optimization control method, characterized in that the method comprises:
acquiring the current system state of a control system;
inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain an optimization control action recommendation quantity; the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework;
and executing corresponding optimized control operation based on the optimized control action recommended quantity.
2. The method according to claim 1, wherein the inputting the current state of the system into a pre-constructed optimization model for processing based on a preset optimization strategy to obtain a recommended amount of optimization control actions comprises:
carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer greater than or equal to 1;
acquiring an original track sequence of the N control tracks;
selecting a target track sequence set which meets a preset condition from the original track sequence;
carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence;
and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.
3. The method according to claim 2, wherein the performing trajectory optimization on each trajectory in the target trajectory sequence set to obtain an optimized action sequence comprises:
summing the reward values of all tracks in the target track sequence set to obtain a cumulative reward value;
and performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward value to obtain an optimized action sequence.
4. The method of claim 1, wherein the dynamic characteristics model is constructed by a process comprising:
acquiring an offline data set of a control system; the off-line data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
and constructing the dynamic characteristic model through a deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on the off-line data set.
5. The method of claim 1, wherein the behavior strategy model is constructed by a process comprising:
acquiring an offline data set of a control system; the offline data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
and constructing the behavior strategy model through a deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on the off-line data set.
6. The method of claim 1, wherein the act-value function model is constructed by:
acquiring an offline data set of a control system; the offline data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;
the action value function model is constructed from the offline data set and the fitted cost function estimate FQE.
7. The method of claim 1, further comprising:
and if the monitored control task of the control system changes, adjusting the recommended quantity of the optimized control action based on a target adaptive control strategy and/or a constraint control strategy.
8. An optimization control apparatus, characterized in that the apparatus comprises:
the acquisition unit is used for acquiring the current system state of the control system;
the processing unit is used for inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain the recommended quantity of the optimization control action; the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework;
and the execution unit is used for executing corresponding optimized control operation based on the optimized control action recommended quantity.
9. A storage medium comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium resides to perform the optimization control method of any one of claims 1 to 7.
10. An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the optimization control method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210277509.5A CN114625091A (en) | 2022-03-21 | 2022-03-21 | Optimization control method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210277509.5A CN114625091A (en) | 2022-03-21 | 2022-03-21 | Optimization control method and device, storage medium and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114625091A true CN114625091A (en) | 2022-06-14 |
Family
ID=81904254
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210277509.5A Pending CN114625091A (en) | 2022-03-21 | 2022-03-21 | Optimization control method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114625091A (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950690A (en) * | 2019-05-15 | 2020-11-17 | 天津科技大学 | Efficient reinforcement learning strategy model with self-adaptive capacity |
CN112884130A (en) * | 2021-03-16 | 2021-06-01 | 浙江工业大学 | SeqGAN-based deep reinforcement learning data enhanced defense method and device |
CN113363997A (en) * | 2021-05-28 | 2021-09-07 | 浙江大学 | Reactive voltage control method based on multi-time scale and multi-agent deep reinforcement learning |
CN113392935A (en) * | 2021-07-09 | 2021-09-14 | 浙江工业大学 | Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism |
CN113759708A (en) * | 2021-02-09 | 2021-12-07 | 京东城市(北京)数字科技有限公司 | System optimization control method and device and electronic equipment |
CN113885607A (en) * | 2021-10-20 | 2022-01-04 | 京东城市(北京)数字科技有限公司 | Steam temperature control method and device, electronic equipment and computer storage medium |
CN113935463A (en) * | 2021-09-30 | 2022-01-14 | 南方电网数字电网研究院有限公司 | Microgrid controller based on artificial intelligence control method |
CN114011564A (en) * | 2021-10-22 | 2022-02-08 | 内蒙古京能康巴什热电有限公司 | Coal mill control optimization method based on model offline planning |
CN114065452A (en) * | 2021-11-17 | 2022-02-18 | 国家电网有限公司华东分部 | Power grid topology optimization and power flow control method based on deep reinforcement learning |
-
2022
- 2022-03-21 CN CN202210277509.5A patent/CN114625091A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950690A (en) * | 2019-05-15 | 2020-11-17 | 天津科技大学 | Efficient reinforcement learning strategy model with self-adaptive capacity |
CN113759708A (en) * | 2021-02-09 | 2021-12-07 | 京东城市(北京)数字科技有限公司 | System optimization control method and device and electronic equipment |
CN112884130A (en) * | 2021-03-16 | 2021-06-01 | 浙江工业大学 | SeqGAN-based deep reinforcement learning data enhanced defense method and device |
CN113363997A (en) * | 2021-05-28 | 2021-09-07 | 浙江大学 | Reactive voltage control method based on multi-time scale and multi-agent deep reinforcement learning |
CN113392935A (en) * | 2021-07-09 | 2021-09-14 | 浙江工业大学 | Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism |
CN113935463A (en) * | 2021-09-30 | 2022-01-14 | 南方电网数字电网研究院有限公司 | Microgrid controller based on artificial intelligence control method |
CN113885607A (en) * | 2021-10-20 | 2022-01-04 | 京东城市(北京)数字科技有限公司 | Steam temperature control method and device, electronic equipment and computer storage medium |
CN114011564A (en) * | 2021-10-22 | 2022-02-08 | 内蒙古京能康巴什热电有限公司 | Coal mill control optimization method based on model offline planning |
CN114065452A (en) * | 2021-11-17 | 2022-02-18 | 国家电网有限公司华东分部 | Power grid topology optimization and power flow control method based on deep reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kumar et al. | A deep learning architecture for predictive control | |
US9298172B2 (en) | Method and apparatus for improved reward-based learning using adaptive distance metrics | |
JP5448841B2 (en) | Method for computer-aided closed-loop control and / or open-loop control of technical systems, in particular gas turbines | |
Arif et al. | Incorporation of experience in iterative learning controllers using locally weighted learning | |
Goulart et al. | Autonomous pH control by reinforcement learning for electroplating industry wastewater | |
CN111260124A (en) | Chaos time sequence prediction method based on attention mechanism deep learning | |
EP3704550B1 (en) | Generation of a control system for a target system | |
Reyes-Reyes et al. | Bounded neuro-control position regulation for a geared DC motor | |
CN112445136B (en) | Thickener prediction control method and system based on continuous time neural network | |
Marwala | Finite-element-model Updating Using the Response-surface Method | |
CN111930010A (en) | LSTM network-based general MFA controller design method | |
Marusak | A numerically efficient fuzzy MPC algorithm with fast generation of the control signal | |
Agyeman et al. | LSTM-based model predictive control with discrete actuators for irrigation scheduling | |
Serrano-Pérez et al. | Offline robust tuning of the motion control for omnidirectional mobile robots | |
Saber et al. | Real-time optimization for an AVR system using enhanced Harris Hawk and IIoT | |
CN110454322B (en) | Water turbine speed regulation control method, device and system based on multivariable dynamic matrix | |
CN114625091A (en) | Optimization control method and device, storage medium and electronic equipment | |
Kordon | Hybrid intelligent systems for industrial data analysis | |
CN111356959B (en) | Method for computer-aided control of a technical system | |
CN114839861A (en) | Intelligent PID controller online optimization method and system | |
JP2023106043A (en) | Driving assist system, driving assist method, and program | |
Yang et al. | Intelligent Forecasting System Using Grey Model Combined with Neural Network. | |
Kocijan et al. | Application of Gaussian processes to the modelling and control in process engineering | |
Kumar et al. | Architecture, performance and stability analysis of a formula-based fuzzy I− fuzzy P− fuzzy D controller | |
Hachiya et al. | Efficient sample reuse in EM-based policy search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |