CN114625091A

CN114625091A - Optimization control method and device, storage medium and electronic equipment

Info

Publication number: CN114625091A
Application number: CN202210277509.5A
Authority: CN
Inventors: 朱翔宇; 殷宏磊; 徐浩然; 郑宇�
Original assignee: Jingdong City Beijing Digital Technology Co Ltd
Current assignee: Jingdong City Beijing Digital Technology Co Ltd
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-06-14

Abstract

The application discloses an optimization control method, an optimization control device, a storage medium and electronic equipment. Based on the above, training and learning of the optimization model are performed through the offline data set, and when a complex objective function and a nonlinear dynamical model are faced, optimization control is performed through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And moreover, the model predictive control framework is adopted to construct the optimization model, so that even if a new control task target or a control task with additional constraint is faced, retraining and learning of the optimization model are not needed, and the adaptability and the control flexibility of the optimization control method are improved.

Description

Optimization control method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of automatic control technologies, and in particular, to an optimization control method, apparatus, storage medium, and electronic device.

Background

In the production links of various industries, a large number of links of system operation control exist: such as robot control, automatic operation systems of agricultural machinery, automatic control links of intelligent manufacturing, various operation control systems in the fields of energy, chemical engineering, metallurgy and the like in the industrial industry, and the like. By optimizing and controlling the control systems, the utilization efficiency of resources can be improved, the waste of time, materials, energy and the like is reduced, the competitiveness of the industrial industry is improved, the green development target of energy conservation and emission reduction is realized, and the method has great significance for the development and progress of the industry.

Conventional optimization Control methods in industrial Control applications, such as Proportional-Integral-Derivative (PID) controllers and Model Predictive Control (MPC) controllers.

In the optimization problem of a complex control system, the traditional optimization control method has poor effect. On one hand, the solving capability of the traditional optimization control method limits the optimization effect of the traditional optimization control method when the traditional optimization control method faces increasingly complex control systems; on the other hand, the traditional optimization control method lacks effective utilization of mass data precipitated in a control system, and seriously depends on human experience, theoretical derivation or simulation environment consistent with real environment during model design, and in the face of complex targets and nonlinear dynamic models, the solution is difficult and inefficient, so that the control method lacks universality. In addition, a large amount of computing resources are needed in the learning process of the control system offline strategy, and the application is limited in the scene of limited computing resources; the method is lack of adaptability to the change of control task targets and poor in control flexibility.

Therefore, the conventional optimization control method has poor generality and poor control flexibility.

Disclosure of Invention

In view of this, the present application discloses an optimization control method, an optimization control device, a storage medium, and an electronic apparatus, which aim to improve the versatility, the adaptability, and the control flexibility of the optimization control method.

In order to achieve the purpose, the technical scheme is as follows:

the first aspect of the present application discloses an optimization control method, including:

acquiring the current system state of a control system;

inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain an optimization control action recommendation quantity; the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework;

and executing corresponding optimized control operation based on the optimized control action recommended quantity.

Preferably, the inputting the current state of the system into a pre-constructed optimization model for processing based on a preset optimization strategy to obtain a recommended amount of the optimization control action includes:

carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer greater than or equal to 1;

acquiring an original track sequence of the N control tracks;

selecting a target track sequence set which meets a preset condition from the original track sequence;

carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence;

and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.

Preferably, the performing the trajectory optimization on each trajectory in the target trajectory sequence set to obtain an optimized action sequence includes:

summing the reward values of all tracks in the target track sequence set to obtain an accumulated reward value;

and performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward value to obtain an optimized action sequence.

Preferably, the process of constructing the dynamic characteristics model includes:

acquiring an offline data set of a control system; the offline data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;

and constructing the dynamic characteristic model through a deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on the off-line data set.

Preferably, the construction process of the behavior strategy model includes:

and constructing the behavior strategy model through a deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on the off-line data set.

Preferably, the process of constructing the action value function model includes:

acquiring an offline data set of a control system; the off-line data set is used for representing a set of system characteristic data accumulated by the control system in a preset historical time period;

the action value function model is constructed from the offline data set and the fitted cost function estimate FQE.

Preferably, the method further comprises the following steps:

and if the monitored control task of the control system changes, adjusting the recommended quantity of the optimized control action based on a target adaptive control strategy and/or a constraint control strategy.

A second aspect of the present application discloses an optimization control apparatus, the apparatus comprising:

the acquisition unit is used for acquiring the current system state of the control system;

the processing unit is used for inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain the recommended quantity of the optimization control action; the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework;

and the execution unit is used for executing corresponding optimized control operation based on the optimized control action recommended quantity.

A third aspect of the present application discloses a storage medium, which includes stored instructions, wherein when the instructions are executed, a device in which the storage medium is located is controlled to execute the optimization control method according to any one of the first aspect.

A fourth aspect of the present application discloses an electronic device, comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the optimization control method according to any one of the first aspect.

According to the technical scheme, the optimization control method, the optimization control device, the storage medium and the electronic equipment are used for obtaining the current system state of the control system, inputting the current system state to the pre-constructed optimization model based on the preset optimization strategy for processing to obtain the optimization control action recommendation quantity, jointly constructing the pre-constructed dynamic characteristic model, the pre-constructed behavior strategy model and the pre-constructed action value function model through the model prediction control framework by the optimization model, and executing the corresponding optimization control operation based on the optimization control action recommendation quantity. Based on the scheme, the training and learning of the optimization model are carried out through the off-line data set, and when a complex objective function and a nonlinear dynamic model are faced, the optimization control is carried out through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And moreover, the model predictive control framework is adopted to construct the optimization model, and even if a new control task target or a control task with added constraint is faced, the optimization model does not need to be retrained and learned, so that the adaptability and the control flexibility of the optimization control are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of an optimization control method disclosed in an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating obtaining recommended optimal control actions according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating an optimized action sequence disclosed in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an optimization control device disclosed in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As can be seen from the background art, the existing optimization control method has poor versatility and poor control flexibility.

In order to solve the above problems, embodiments of the present application disclose an optimization control method, an optimization control device, a storage medium, and an electronic device, which achieve the purpose of improving the versatility, the adaptability, and the control flexibility of the optimization control. The specific implementation is specifically illustrated by the following examples.

Referring to fig. 1, a schematic flow chart of an optimization control method disclosed in an embodiment of the present application is shown, where the optimization control method mainly includes the following steps:

s101: and acquiring the current system state of the control system.

In S101, the control system includes a robot control system, an agricultural machine automatic operation control system, a thermal power generation boiler control system, a wind power generation control system, and the like.

The current state of the system is used to indicate characteristics of the current operating state of the system controlling the system.

The current state of the system is different for different control systems. If the control system is a thermal power generation control system, the current state of the system includes a temperature state of wind, a pressure state of wind, a temperature state of water, a pressure state of water, and the like.

S102: inputting the current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy to obtain an optimization control action recommendation quantity; and the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework.

In S102, the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model, and a pre-constructed action value function model through a finite time domain model predictive control framework.

The specific dynamic characteristic model is constructed as follows:

firstly, acquiring an offline data set of a control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.

The offline data sets corresponding to different control systems are different, and if the control system is a thermal power generation control system, the offline data sets include temperature history data of wind, pressure history data of wind, temperature history data of water, pressure history data of water, and the like.

The control systems are different in corresponding system characteristic data, and if the control system is a thermal power generation control system, the system characteristic data includes temperature data, wind pressure data, water temperature data, water pressure data, and the like.

The preset history period is a period of time during which a history data record exists before the current time. The predetermined historical period may be 1 to 2 years or several months, determined primarily by the time of accumulation of historical data in the control system.

And then, constructing a dynamic characteristic model through the deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on an off-line data set.

The dynamic characteristic model is constructed and trained by adopting a deep neural network based on an off-line data set of a collection control system to obtain the dynamic characteristic model, and the expression of the dynamic characteristic model is shown as a formula (1).

(r_t，s_t+1)＝f_m(s_t，a_t) (1)

Wherein r is_tA value for the currently acquired reward; s_t+1The state of the system at the next moment; s_tThe current state of the system; a is_tFor the control action to be carried out, f_mIs a dynamic behavior model.

The input of the dynamic characteristic model is the current state s of the system_tAnd the control action a executed_tThe output is the currently obtained reward value r_tAnd the state s of the next moment of the system_t+1. Through initializing different model parameters of the dynamic characteristic model, training to obtain K dynamic characteristicsAnd in the sexual model, the value of K is an integer which is more than or equal to 1, and the expressive force of the dynamic characteristic model is further enhanced by integrating the output results of the K dynamic characteristic models (the output results of all the dynamic characteristic models are averaged or the result of one model is randomly selected as output).

The construction process of the specific behavior strategy model is as follows:

acquiring an offline data set of a control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.

And constructing a behavior strategy model through the deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on an off-line data set.

The behavior strategy model is constructed and trained by adopting a deep neural network based on the collected control system offline data set, and the expression of the behavior strategy model is shown as a formula (2).

a_t＝f_b(s_t) (2)

Wherein, a_tA control action to be performed; f. of_b(s_t) Is a behavioral policy model.

The input of the behavior strategy model is the current state s of the system_tOutput as an action performed a_t. The K behavior strategy models are obtained through initializing different model parameters of the behavior strategy models, training, and the expressive force of the behavior strategy models is further enhanced through integrating output results of the K models (averaging the output results of all the behavior strategy models or randomly selecting the result of one model as output).

The construction process of the action value function model is as follows:

And (3) constructing an action value function model through the offline data set and the Fitted value function Evaluation (FQE), wherein the expression of the action value function model is shown as a formula (3).

Wherein the content of the first and second substances,

as a function of the action value; f(s)_i，a_i) A model is a function of the action value in training; f is the selected action value function model class; y is_iIs a training target; the value of N is the data volume of the training data set; the value of i is an integer from 1 to N.

y_iThe formula (4) is shown below.

Wherein r is_iIs a prize value; gamma 'is a reward discount factor, and gamma' is less than 1;

the action value function obtained for the last training iteration is in(s)_i+1,a_i+1) The value of (d); s_iThe state at the moment i; a is_iThe action at the moment i; s_i+1The state at the moment i + 1; a is_i+1The action at the moment i + 1; and B is an offline data set.

Equation (5) can be further estimated by the action value function.

Wherein, V_b(s_t) As a function of value;

expressing the expectation of the action value function under the action sampled according to the action strategy; q(s)_tAnd a) is a function of the action value.

The preset optimization strategies comprise a track sampling method, a track pruning method and a track optimization method. Inputting the current state of the system into a pre-constructed optimization model for optimization processing through a track sampling method, a track pruning method and a track optimization method to obtain an optimal control action sequence

Then controlling the action sequence from the optimum

Picking up action

As the optimal control action recommendation.

And (3) constructing an optimization model:

and (3) jointly constructing a dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a finite time domain model predictive control framework to obtain an optimization model, wherein the expression of the optimization model is shown as a formula (6).

Wherein r is_tThe reward value at the moment t; v_b(s_H) Is a state s_HThe value of the following value function. The constraint condition of the optimization problem is the dynamic characteristic model (r)_t,s_t+1)＝f_m(s_t,a_t)。

At each step, the current state s of the system is given₀By solving the above finite time domain optimization problem, an optimal control action sequence of length H can be obtained, where s₀＝s_init，s_initIs given as the characteristic value of the current state of the system.

Specifically, the current state of the system is input to a pre-constructed optimization model for processing based on a preset optimization strategy, and a process of optimizing the recommended quantity of the control action is obtained, as shown in A1-A5.

A1: carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer of 1 or more.

Wherein, the state s of the system at the current time is used as a starting point, and a dynamic characteristic model f is utilized_mAnd behavior strategy model f_bAnd carrying out track sampling, and simulating to generate N control tracks with the length of H, wherein the value of H is an integer greater than 1, and the sampling method is a track sampling method under the guidance of an action value function.

Taking the generation of one of the trajectories as an example, at the t-th time step on the control trajectory, the dynamic characteristic model f is used_mThe predicted current state of the system is s_tDefine the behavior strategy f at this time_b(s_t) Mean value μ of the generated motion distribution^a(s_t) And standard deviation sigma^a(s_t)。

Behavior strategy f_b(s_t) Mean value μ of the generated motion distribution^a(s_t) The expression of (c) is shown in equation (7).

Wherein the content of the first and second substances,

including the mean of each dimension of the motion distribution, i.e.

A is the motion space and T is the transposed symbol.

Behavior strategy f_b(s_t) Standard deviation σ of the generated motion distribution^a(s_t) The expression of (c) is shown in equation (8).

Wherein the content of the first and second substances,

the standard deviation of each dimension of the motion distribution is included,

at time t, the control action is sampled by equations 9, 10 and 11.

Wherein the content of the first and second substances,

for the action obtained by sampling, N represents a normal distribution and a constant σ_MScaling the standard deviation of the distribution of each dimension of motion by a standard deviation scaling factor to adjust the aggressiveness, σ, of the motion sample_M＞0。

Wherein M is_tA set of m sampled actions; t is the time; h is the length of the control track.

Wherein Q is_b(s_tA) is a function of action value;

to be M_tSubstituting each action into action value function Q_b(s_tA) corresponding actions when the maximum result is obtained after calculation respectively; h is the length of the control track.

Selecting an action value function Q_b(s_tA) maximum action

Is transported last timeCarrying out the optimal control sequence obtained by solving the solving process

Control action of corresponding time

And mixing according to the mixing coefficient beta to obtain sampling action

The formula (2) is shown in the formula (12).

Wherein the content of the first and second substances,

is a sampling action; beta is a mixing coefficient;

a control action at time t + 1;

the control action at the time H;

the control action at the time H-1.

A2: an original trajectory sequence of N control trajectories is obtained.

Wherein the system state s of N control tracks is obtained_tAct of down-sampling

In a state s_tUsing the dynamic characteristic model f_m(s_t，a_t) Can calculate the reward r_tAnd the sameSystem state s at the next moment of the control trajectory_t+1. According to the sampling method of the single control track, the original track sequence T of the N control tracks is obtained by sampling with the current state s of the system as a starting point, wherein the original track sequence T is { T ═ T }₁，…T_N}, original trajectory sequence

Contains H state-action pairs, the expression of which is shown in equation (13).

Wherein, T_nIs the nth track sequence;

a state-action pair at the time t in the nth track sequence; h is the length of the control track, and the value of H is an integer greater than 1; t is time, and the value of t is an integer from 0 to H-1; n is the number of original track sequences, and the value of N is an integer greater than or equal to 1.

A3: and selecting a target track sequence set meeting preset conditions from the original track sequences.

The track pruning method is based on dynamic uncertainty estimation. The original trajectory sequence T obtained by the above trajectory sampling method is ═ T { (T })₁，…T_NAnd (4) deleting the undesirable track sequences by using a dynamic uncertainty measuring method (a track pruning method) to obtain a target track sequence set meeting preset conditions.

The calculation formula of the measurement method of the dynamic uncertainty is shown as formula (14).

Where disc (s, a) is the dynamic uncertainty;

is a dynamic characteristic model, l belongs to 1, …, K, K is K dynamic characteristic models obtained by training; l can be any integer from 1 to K, i and j are uniformly referred to by l, and i and j can be any integer from 1 to K.

Calculating an original track sequence T ═ { T through a dynamic uncertainty measuring method₁，…T_NAnd for any one track sequence, if the uncertainty calculation value of the contained state-action pair is greater than a given uncertainty threshold value L, taking the whole track sequence from the original track sequence T to { T ═ T }₁，…T_NRemoving the target track sequence set T to obtain a target track sequence set T_f。

The value of the uncertainty threshold L is set by a technician according to an actual situation, and is not specifically limited in the present application.

If the number of the residual track sequences after the operation is less than N_m(N_mConstant less than N), the uncertainty calculations in all trajectory sequences are summed, the top N with the smaller uncertainty sum is selected_mTaking the strip track sequence as a target track sequence set T_f。

A4: and carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence.

Specifically, the process of optimizing the track of each track in the target track sequence set to obtain the optimized action sequence is as follows:

firstly, the reward values of all tracks in the target track sequence set are summed to obtain a cumulative reward value.

Wherein, the target track sequence set T obtained by the method of track optimization of cumulative reward_fThe accumulated reward r of each step in each track is accumulated and summed to obtain the accumulated reward of each track sequence, and the expression of the accumulated reward of each track sequence is shown as formula (15).

Wherein R is_fThe accumulated reward of each track sequence is contained in the track sequence; r₁Set T for target track sequence_fA jackpot for the first sequence of tracks;

set T for target track sequence_fThe jackpot for the last track sequence.

For the target track sequence set T obtained above_fThe calculation formula for cumulatively summing the rewards r of each step in each track is shown in equation (16).

Wherein R is_nSet T for target track sequence_fThe cumulative sum of the rewards r for each step in each track; r is_tThe reward value at the t moment in the track; h is the track length; t is 0 to H-1; v_b(s_H) Is a state s_HThe value of the following value function.

Then, the action of each track in the target track sequence set is weighted and calculated through the accumulated reward value, so as to obtain an optimized action sequence, and an expression of the optimized action sequence is shown as an expression (17).

Wherein the content of the first and second substances,

to optimize the sequence of actions; k is a weighting factor; r_nSet T for target track sequence_fThe reward r of each step in each track is accumulated and summed;

as a sequence of tracks T_n∈T_fAn act of a tth time step; t is_fA target track sequence set is obtained; h is the length of the control track; the value of n is an integer from 1 to the total number of target track sequences.

A5: and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.

Obtaining the optimal control action sequence of the system under the current state s through the formula (17)

And returns the action at the current time in the optimal control action sequence (the first action)

) As the optimal control action recommendation.

And when the recommended quantity of the optimized control action is carried out at the next moment of the system, the recommended quantity of the optimized control action can be obtained according to the calculation flow A1-A5.

S103: and executing the corresponding optimized control operation based on the optimized control action recommended quantity.

And different optimization control action recommendation quantities and different optimization control operations are executed.

For example, if the control system is a wind power generation control system and the recommended optimal control action amount is 1300r/min of the fan rotation speed, the optimal control operation is executed to control the fan rotation speed to be 1300 r/min.

Model training is carried out through the collected offline data set of the control system, an optimization model is built based on a finite time domain model prediction control framework, the optimization model is solved by using a track optimization method to obtain an optimized control action, and optimization control of the system is achieved.

Optionally, if the control task of the monitored control system changes, the recommended amount of the optimal control action is adjusted based on the target adaptive control strategy and/or the constraint control strategy.

For a certain control task target, a new scenario or control task target may be faced, that is, the new target is changed compared with the target optimized by the real behavior strategy in the stage of generating offline data, and in this case, the technical scheme still has strong adaptability to the new target.

To facilitate understanding of the process of the new target changing compared to the target optimized by the real behavior strategy in the generation of the offline data phase, the following examples are given:

for example, the initial control task goal is to improve the power generation efficiency of the wind power generation control system, and when the new goal is changed from the goal (initial control task goal) optimized by the real behavior strategy in the stage of generating the offline data, the temperature of the fan of the wind power generation control system is controlled within a certain temperature range while the power generation efficiency of the control system is improved.

The control optimization task target change can be mainly divided into two types: firstly, a target adaptation control strategy is adopted, namely, the original task target reward is adjusted, so that the control strategy is controlled and optimized according to a new target; and the other is a constraint control strategy, namely, a limiting condition aiming at the system state is added in the control process to constrain the control action, but the original task reward function is kept unchanged.

Target adaptive control strategy:

for new control objectives, a new reward function r is defined_new＝f_obj(s, a) calculating a jackpot for the sampled sequence of tracks using the new reward function. Specifically, R 'is modified through calculation of track cumulative rewards'_n＝∑_tf_obj(s_t，a_t) And changes the calculation of the optimal action sequence as shown in equation (18).

And (3) constraint control strategy:

for a control task added with system state constraint, not only the original optimization target needs to be ensured, but also a control strategy capable of meeting the constraint condition needs to be searched. The technical scheme has two ways to adapt to the control situation:

adding a penalty based on the constraint condition to the original reward function. The penalty function defining the constraint is f_p(s), then the new track sequence jackpot calculation is R'_n＝∑_t(α·r_t+(1-α)·f_p(s_t))，r_tThe original goal is awarded and the calculation of the optimal sequence of actions is varied as shown in equation (19).

Solving equation (19) results in a control strategy that is adaptive to the constraints.

And adding constraint punishment in the track pruning stage. Defining a penalty function f 'based on constraint conditions'_p(s), the uncertainty estimate is adjusted as follows, and the adjustment calculation formula is shown in formula (20).

Constraint conditions are taken into consideration in the track pruning stage, and the purpose of adapting to constraint control can be achieved without changing the subsequent calculation process.

In the embodiment of the application, the training and learning of the optimization model are carried out through the off-line data set, and when a complex objective function and a nonlinear dynamical model are faced, the optimization control is carried out through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And an optimization model is constructed by adopting a model predictive control framework, so that even if a new control task target or a control task with additional constraint is faced, retraining and learning of the optimization model are not needed, and the adaptability and the control flexibility of optimization control are improved.

Referring to fig. 2, a process involved in S102 that the current state of the system is input to a pre-constructed optimization model based on a preset optimization strategy to perform optimization processing to obtain a recommended amount of an optimization control action mainly includes the following steps:

s201: carrying out track sampling based on a dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer of 1 or more.

S202: an original trajectory sequence of N control trajectories is obtained.

S203: and selecting a target track sequence set meeting preset conditions from the original track sequence.

S204: and carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence.

S205: and selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.

The execution principle of S201-S205 is consistent with the execution principle of S102, and may be referred to herein, and is not described herein again.

In the embodiment of the application, each track in the target track sequence set is optimized through a track optimization method, so that the purpose of obtaining an optimized action sequence is achieved.

Referring to fig. 3, a process of performing trajectory optimization on each trajectory in the target trajectory sequence set to obtain an optimized action sequence in S204 mainly includes the following steps:

s301: and summing the reward values of all tracks in the target track sequence set to obtain a cumulative reward value.

S302: and performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward value to obtain an optimized action sequence.

The execution principle of S301-S302 is consistent with the execution principle of S204, and it can be referred to here, which is not described again.

In the embodiment of the application, the reward values of all tracks in the target track sequence set are summed to obtain the accumulated reward value, and the action of all tracks in the target track sequence set is weighted and calculated through the accumulated reward value to achieve the purpose of obtaining the optimized action sequence.

Based on the optimization control method disclosed in fig. 1 in the foregoing embodiment, an optimization control apparatus is correspondingly disclosed in the embodiment of the present application, and as shown in fig. 4, the optimization control apparatus includes an obtaining unit 401, a processing unit 402, and an executing unit 403.

An obtaining unit 401 is configured to obtain a system current state of the control system.

The processing unit 402 is configured to input a current state of the system to a pre-constructed optimization model for optimization processing based on a preset optimization strategy, so as to obtain an optimization control action recommendation amount; and the optimization model is obtained by jointly constructing a pre-constructed dynamic characteristic model, a pre-constructed behavior strategy model and a pre-constructed action value function model through a model prediction control framework.

An executing unit 403, configured to execute a corresponding optimal control operation based on the optimal control action recommendation amount.

Further, the processing unit 402 includes a sampling module, a first obtaining module, a first selecting module, an optimizing module, and a second selecting module.

The sampling module is used for carrying out track sampling based on the dynamic characteristic model and the behavior strategy model to obtain N control tracks; n is an integer of 1 or more.

The first acquisition module is used for acquiring an original track sequence of the N control tracks.

The first selection module is used for selecting a target track sequence set which meets preset conditions from the original track sequence.

And the optimization module is used for carrying out track optimization on each track in the target track sequence set to obtain an optimized action sequence.

And the second selection module is used for selecting the action at the current moment in the optimized action sequence as the recommended quantity of the optimized control action in the current state of the system.

Further, the optimization module includes a summation sub-module and a calculation sub-module.

And the summation submodule is used for summing the reward values of all the tracks in the target track sequence set to obtain the accumulated reward value.

And the calculation submodule is used for performing weighted calculation on the actions of all the tracks in the target track sequence set through the accumulated reward values to obtain an optimized action sequence.

Further, the processing unit 402 of the building process of the dynamic characteristics model includes a second obtaining module and a first building module.

The second acquisition module is used for acquiring an offline data set of the control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system over a preset historical period.

The first construction module is used for constructing a dynamic characteristic model through the deep neural network, the current state of the system, the execution action in the current state of the system, the current reward value and the state of the system at the next moment, and performing off-line training on the dynamic characteristic model based on an off-line data set.

Further, the processing unit 402 of the construction process of the behavior policy model includes a third obtaining module and a second constructing module.

The third acquisition module is used for acquiring an offline data set of the control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system in a preset history.

And the second construction module is used for constructing the behavior strategy model through the deep neural network, the current state of the system and the execution action in the current state of the system, and performing off-line training on the behavior strategy model based on the off-line data set.

Further, the processing unit 402 of the construction process of the action value function model includes a fourth obtaining module and a third constructing module.

The fourth acquisition module is used for acquiring an offline data set of the control system; the offline data set is used to characterize a set of system characteristic data accumulated by the control system in a preset history.

A third building module for building the action value function model from the offline dataset and the fitted cost function assessment FQE.

Further, the optimization control device further comprises an adjusting unit.

And the adjusting unit is used for adjusting the recommended quantity of the optimized control action based on the target adaptive control strategy and/or the constraint control strategy if the control task of the monitored control system changes.

In the embodiment of the application, the training and learning of the optimization model are carried out through the off-line data set, and when a complex objective function and a nonlinear dynamical model are faced, the optimization control is carried out through the optimization model, so that the use efficiency of data and the universality of the optimization control are improved. And moreover, the model predictive control framework is adopted to construct the optimization model, so that even if a new control task target or a control task with additional constraint is faced, retraining and learning of the optimization model are not needed, and the adaptability and the control flexibility of the optimization control method are improved.

The embodiment of the application also provides a storage medium, wherein the storage medium comprises stored instructions, and when the instructions are executed, the equipment where the storage medium is located is controlled to execute the optimization control method.

The embodiment of the present application further provides an electronic device, whose schematic structural diagram is shown in fig. 5, specifically including a memory 501 and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and are configured to be executed by one or more processors 503 to execute the one or more instructions 502 to perform the above-mentioned optimization control method.

The specific implementation procedures and derivatives thereof of the above embodiments are within the scope of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is only a preferred embodiment of the present application and it should be noted that, as will be apparent to those skilled in the art, numerous modifications and adaptations can be made without departing from the principles of the present application and such modifications and adaptations are intended to be considered within the scope of the present application.

Claims

1. An optimization control method, characterized in that the method comprises:

acquiring the current system state of a control system;

2. The method according to claim 1, wherein the inputting the current state of the system into a pre-constructed optimization model for processing based on a preset optimization strategy to obtain a recommended amount of optimization control actions comprises:

acquiring an original track sequence of the N control tracks;

3. The method according to claim 2, wherein the performing trajectory optimization on each trajectory in the target trajectory sequence set to obtain an optimized action sequence comprises:

summing the reward values of all tracks in the target track sequence set to obtain a cumulative reward value;

4. The method of claim 1, wherein the dynamic characteristics model is constructed by a process comprising:

5. The method of claim 1, wherein the behavior strategy model is constructed by a process comprising:

6. The method of claim 1, wherein the act-value function model is constructed by:

7. The method of claim 1, further comprising:

8. An optimization control apparatus, characterized in that the apparatus comprises:

9. A storage medium comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium resides to perform the optimization control method of any one of claims 1 to 7.

10. An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the optimization control method of any of claims 1-7.