CN111324167A - Photovoltaic power generation maximum power point tracking control method and device - Google Patents

Photovoltaic power generation maximum power point tracking control method and device Download PDF

Info

Publication number
CN111324167A
CN111324167A CN202010123212.4A CN202010123212A CN111324167A CN 111324167 A CN111324167 A CN 111324167A CN 202010123212 A CN202010123212 A CN 202010123212A CN 111324167 A CN111324167 A CN 111324167A
Authority
CN
China
Prior art keywords
action
maximum power
power generation
photovoltaic power
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010123212.4A
Other languages
Chinese (zh)
Other versions
CN111324167B (en
Inventor
崔承刚
钱申晟
官乐乐
杨宁
张传林
陈辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Electric Power University
Original Assignee
Shanghai Electric Power University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Electric Power University filed Critical Shanghai Electric Power University
Priority to CN202010123212.4A priority Critical patent/CN111324167B/en
Publication of CN111324167A publication Critical patent/CN111324167A/en
Application granted granted Critical
Publication of CN111324167B publication Critical patent/CN111324167B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05FSYSTEMS FOR REGULATING ELECTRIC OR MAGNETIC VARIABLES
    • G05F1/00Automatic systems in which deviations of an electric quantity from one or more predetermined values are detected at the output of the system and fed back to a device within the system to restore the detected quantity to its predetermined value or values, i.e. retroactive systems
    • G05F1/66Regulating electric power
    • G05F1/67Regulating electric power to the maximum power available from a generator, e.g. from solar cell
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02EREDUCTION OF GREENHOUSE GAS [GHG] EMISSIONS, RELATED TO ENERGY GENERATION, TRANSMISSION OR DISTRIBUTION
    • Y02E10/00Energy generation through renewable energy sources
    • Y02E10/50Photovoltaic [PV] energy
    • Y02E10/56Power conversion systems, e.g. maximum power point trackers

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Sustainable Development (AREA)
  • Sustainable Energy (AREA)
  • Power Engineering (AREA)
  • Physics & Mathematics (AREA)
  • Electromagnetism (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Automation & Control Theory (AREA)
  • Photovoltaic Devices (AREA)

Abstract

The invention discloses a photovoltaic power generation maximum power point tracking control method and a device, which comprises the steps of intelligently tracking a photovoltaic power generation maximum power point in a photovoltaic model; continuously interacting with the environment by using a feedback signal of an agent in reinforcement learning, adjusting and improving intelligent decision-making behaviors, and obtaining an optimal tracking strategy; and the intelligent agent decides an optimal energy storage scheduling strategy and tracks the maximum power point of the photovoltaic power generation in a continuously changing environment. The invention has the beneficial effects that: the algorithm is universal under the conditions of fixed environmental conditions and no prior knowledge, the system has a simple structure, is not easy to generate misjudgment, and can accurately track the maximum power point; and under the condition of sudden change of environmental conditions, the control strategy can also track the maximum power point more quickly.

Description

Photovoltaic power generation maximum power point tracking control method and device
Technical Field
The invention relates to the technical field of photovoltaic power generation maximum power point tracking sources, in particular to a photovoltaic power generation maximum power point tracking control method and device based on reinforcement learning.
Background
In recent years, the industry always searches the maximum power point through a traditional control theory method. The traditional fixed voltage tracking method is simple to control and high in tracking speed, but in places with severe changes of environmental conditions, the method is poor in control accuracy. The disturbance observation method has a simple overall structure and a small disturbance parameter, but the method needs a more accurate step length, and the method is easy to generate a 'misjudgment' phenomenon. The traditional conductance increment method has high tracking precision, but the method depends on a microprocessor or a digital signal processor, thereby leading to complex system structure and high cost.
Disclosure of Invention
This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.
The present invention has been made in view of the above-mentioned conventional problems.
Therefore, the technical problem solved by the invention is as follows: the photovoltaic power generation maximum power point tracking control method is provided, and the problems that a large amount of accurate prior experience is needed and misjudgment is easy to occur in the traditional method are solved.
In order to solve the technical problems, the invention provides the following technical scheme: a photovoltaic power generation maximum power point tracking control method comprises the steps of intelligently tracking a photovoltaic power generation maximum power point in a photovoltaic model; continuously interacting with the environment by using a feedback signal of an agent in reinforcement learning, adjusting and improving intelligent decision-making behaviors, and obtaining an optimal tracking strategy; and the intelligent agent decides an optimal energy storage scheduling strategy and tracks the maximum power point of the photovoltaic power generation in a continuously changing environment.
As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the intelligent tracking comprises the steps of modeling and describing a photovoltaic power generation maximum power point tracking process as a Markov decision process; and constructing a reinforced learning algorithm for tracking the maximum power point of the photovoltaic power generation based on the Markov decision process, wherein the reinforced learning algorithm comprises an environment model, an action space model, a reward function model and a Q value algorithm model, and the intelligent tracking of the maximum power point of the photovoltaic power generation is realized.
As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the photovoltaic model comprises a photovoltaic power supply, a DC/DC direct-current buck converter and a resistance load; the decision module of the DC/DC direct-current buck converter is configured to be an intelligent body structure for strengthening learning and used for tracking a maximum power tracking point; according to the dynamic change of the power, the change value of the power in unit time is judged, and the scheduling strategy adjusts and tracks the maximum power point of the photovoltaic power generation through the controller.
As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the reinforcement learning comprises the steps of setting corresponding task targets; the intelligent agent interacts with the environment through actions; the reinforcement learning algorithm utilizes the interaction data of the agent and the environment to modify the action strategy of the agent; and finally obtaining the optimal action strategy of the corresponding task after iterative learning for a plurality of times.
As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the training model of the reinforcement learning algorithm comprises tuples(s) formed by state, behavior, reward and next statet,at,rt+1,st+1) Is a sample; stIs the current state, atFor the action performed in the current state, rt+1For instant rewards, s, obtained after performing an actiont+1The next state; the learning objective for defining the Q-value network is rt+1+γ·maxaQ(st+1A), the objective function is the reward earned by the current action plus the maximum expected value earned by the next step, where rt+1Is the reward for the next stage, gamma is the discount factor, Q(s)t+1And a) is Q value; the maximum expected value of the next step is multiplied by a discount factor gamma to evaluate future rewards versus current statusGamma ∈ [0,1] is set according to the importance of future awards in learning](ii) a The iterative process for a Q-value network is represented as:
Figure BDA0002393628310000021
as a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the agent comprises, at each time step, the agent observing an ambient quantity inclusion state stAnd action atAnd a reward function rt(ii) a The agent is in the current state stTake action atAnd transits to the next state s through the action function At+1:st+1=A(st,at),
Figure BDA0002393628310000022
Environment according to current state stAnd performing action atAnd the next state st+1By the reward function R: r ist=R(st,at,st+1),
Figure BDA0002393628310000023
Returning; after the agent takes action in a certain state s, accumulated reward is defined
Figure BDA0002393628310000024
To measure the value of the state s to act on; will Qh(s, a) is called a state-action value function
Figure BDA0002393628310000025
Representing the value of the intelligent agent in making a corresponding strategy in a certain state s and a certain action a; definition of Q*(s, a) is the maximum state-action value function of all policies
Figure BDA0002393628310000026
If Q is known*(s, a), then the optimal strategy Gt *By directly maximizing Q*(s, a) determining:
Figure BDA0002393628310000027
as a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the state space model comprises the use of the following three basic states stDefine S ∈ (I)PV;VPV(ii) a Deg), the first two state variables IPVAnd VPVCurrent, voltage representing the normalized and discretized operating point of the photovoltaic array; status parameter IPVAnd VPVShort-circuit current I respectively normalized to photovoltaic power supplySCAnd open circuit voltage UOCMeanwhile, a third state quantity Deg is set, which is specifically defined as follows:
Figure BDA0002393628310000031
when parameter Deg becomes zero, the maximum power point is reached;
when the parameter Deg is negative, the characterization operating point is to the left of the MPP point; when parameter Deg is positive, the characterization operating point is to the right of the MPP point.
As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the motion space model comprises a motion atIs set to the duty cycle of the DC/DC converter, which has different optimal values for different operating conditions and photovoltaic power sources, and varies from 0 to 1; to ensure the computational efficiency of the algorithm herein, a discrete finite motion space set a is defined, comprising positive and negative going and comprising zero changes, characterizing the value of the change in duty cycle: s ═ S1,s1,...,sn}; the reward function model includes the use of the following reward function rt=r(at,dt)=r+(at,dt)+r-(at,dt) In the formula: a istRepresents an action value dtRepresents the load demand, r+(at,dt) Representing rewards, r, to meet user load requirements-(at,dt) Is shown asThe penalty of the load demand can be met.
As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the Q-value algorithm model comprises that an agent cannot be always in a known environment, so an epsilon-greedy strategy is introduced:
Figure BDA0002393628310000032
wherein epsilon is a random value, | A(s) | is an action evaluation value; the epsilon-greedy strategy is the most basic and most common strategy in reinforcement learning, and the meaning of the formula is that the probability of selecting the action which maximizes the action value function is
Figure BDA0002393628310000033
The probabilities of other actions are equal probabilities
Figure BDA0002393628310000034
An epsilon-greedy strategy is introduced to balance the relation between the utilization and the exploration of the intelligent agent in the known environment and the unknown environment, wherein the part with the maximum action value function is selected as the utilization, and any probability of other non-optimal actions is the exploration; the updating rule of Q-learning is as follows:
Figure BDA0002393628310000035
the technical problem solved by the invention is as follows: the photovoltaic power generation maximum power point tracking control device is provided, and the problems that a large amount of accurate prior experience is needed and misjudgment is easy to occur in the traditional method are solved.
In order to solve the technical problems, the invention provides the following technical scheme: a photovoltaic power generation maximum power point tracking control device comprises a photovoltaic model module, a control module and a control module, wherein the photovoltaic model module comprises a photovoltaic power supply, a DC/DC direct current buck converter and a resistance load; the reinforcement learning module comprises an intelligent agent interacting with the environment, and the intelligent agent further comprises a state space model module, an action space model module, a reward function model module and a Q value algorithm model module which are respectively used for configuring the state space model, the action space model, the reward function model and the Q value algorithm model, so that the intelligent tracking of the maximum power point of the photovoltaic power generation is realized.
The invention has the beneficial effects that: the algorithm is universal under the conditions of fixed environmental conditions and no prior knowledge, the system has a simple structure, is not easy to generate misjudgment, and can accurately track the maximum power point; and under the condition of sudden change of environmental conditions, the control strategy can also track the maximum power point more quickly.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:
FIG. 1 is a schematic diagram of an equivalent circuit structure of a reinforcement learning photovoltaic cell according to the present invention;
FIG. 2 is a schematic diagram of a reinforcement learning process according to the present invention;
FIG. 3 is a schematic diagram of the I-V curve of the photovoltaic power supply of the present invention;
FIG. 4 is a schematic diagram of a power control strategy trajectory in a photovoltaic power generation process according to the present invention;
FIG. 5 is a schematic diagram of a power control strategy trajectory in another photovoltaic power generation process according to the present invention;
FIG. 6 is a schematic diagram of a maximum power point tracking structure of photovoltaic power generation according to the present invention;
fig. 7 is a schematic structural diagram of a photovoltaic power generation maximum power point tracking control device according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.
Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Example 1
In the embodiment, a photovoltaic power generation maximum power point tracking control method is provided for solving the problem of the existing photovoltaic power generation maximum power point tracking source, and specifically, the photovoltaic power generation maximum power point tracking control method based on Q value reinforcement learning.
A Q value reinforcement learning method in an artificial intelligence method is used, and the reinforcement learning is a model-free and self-learning control method. Based on the characteristics of reinforcement learning and autonomous learning, the control strategy provided by the embodiment can overcome the defects that a large amount of precise prior experience is needed, misjudgment is easy to occur and the like in the traditional method. And the intelligent agent continuously interacts with the Q table to obtain the optimal strategy of the photovoltaic power generation maximum power point tracking control strategy, and the maximum power point of the photovoltaic power generation is tracked in a continuously changing environment.
And (3) continuously interacting with the environment by using a feedback signal of the environment to the intelligent agent, adjusting and improving the intelligent decision-making behavior, and obtaining an optimal strategy. And the intelligent agent makes a decision on an optimal energy storage scheduling strategy through interaction with the environment, and tracks the maximum power point in a continuously changing environment.
Firstly, modeling a photovoltaic power generation maximum power point tracking process is described as a Markov decision process in the embodiment; then, modules such as an environment model, an action space model, a reward function model, a Q value algorithm and the like for tracking the maximum power point of photovoltaic power generation are developed based on a Markov decision process, and intelligent control of tracking the maximum power point of photovoltaic power generation is achieved.
Compared with the prior art, the photovoltaic power generation maximum power point tracking control strategy is realized by utilizing a Q-Learning reinforcement Learning algorithm based on the photovoltaic power generation maximum power point tracking characteristic in the embodiment.
The photovoltaic power generation maximum power point tracking control method specifically comprises the following steps:
s1: intelligently tracking a photovoltaic power generation maximum power point in a photovoltaic model;
the intelligent tracking in this step includes,
modeling and describing a photovoltaic power generation maximum power point tracking process as a Markov decision process;
and constructing a reinforced learning algorithm for tracking the maximum power point of the photovoltaic power generation based on the Markov decision process, wherein the reinforced learning algorithm comprises an environment model, an action space model, a reward function model and a Q value algorithm model, and the intelligent tracking of the maximum power point of the photovoltaic power generation is realized. The environmental model refers to the background of the process of tracking the maximum power point of photovoltaic power generation, and can be understood as modeling of devices such as a photovoltaic power supply, a DC/DC converter, a load and the like. And the environment model refers to the world in which the agent operates. This environment takes as input the current state and actions of the agent and the output is the state of the agent's rewards and next steps. For the embodiment, the environment model specifically includes: the photovoltaic power generation maximum power point, the state value is the position of the working point, and the action is the increase and decrease of the duty ratio.
S2: continuously interacting with the environment by using a feedback signal of an intelligent agent for reinforcement learning, adjusting and improving intelligent decision-making behaviors and obtaining an optimal tracking strategy; the intelligent agent decides an optimal tracking strategy, namely, the maximum power point of photovoltaic power generation is tracked in a constantly changing environment, namely, when the temperature and the illumination intensity in the photovoltaic environment change, the output power of an original working point can be reduced, the output power of the photovoltaic system is finally improved by adjusting the position of the working point through adjusting the duty ratio, when the output power of the photovoltaic system reaches the maximum, the position of the working point is called as the maximum power point, the corresponding duty ratio is the optimal duty ratio, and the optimal tracking strategy is a process that the intelligent agent can track the optimal duty ratio at the fastest speed through the strategy. The intelligent agent obtains after training each time, the intelligent agent interacts with the environment each time to obtain different strategies, and the optimal tracking strategy is finally trained through the reward of training each time.
S3: the intelligent agent is interacted with the environment to decide an optimal energy storage scheduling strategy (namely an optimal tracking strategy), and the maximum power point of photovoltaic power generation is tracked in a constantly changing environment. The scheduling policy can be understood as: the system provides a set of current and voltage for the load, but under the condition of the set of current and voltage, the output power of the system does not reach the maximum, and at this time, the most appropriate current and voltage is required to be tracked to reach the maximum power, and the tracking process is called as a scheduling strategy.
The photovoltaic model in the embodiment comprises a photovoltaic power supply, a DC/DC direct-current buck converter, a resistance load and the like. The structure of the equivalent circuit of the photovoltaic model is shown in figure 1. The decision module of the DC/DC buck converter is configured into an intelligent body structure for strengthening learning, and plays a role in tracking the maximum power tracking point. According to the dynamic change of the power, when the change value of the power in unit time is judged, the scheduling strategy adjusts and tracks the maximum power point of the photovoltaic power generation through the controller.
The scheduling strategy specifically comprises: when the temperature and the illumination intensity in a photovoltaic environment change, the output power of an original working point is reduced, the output power of a photovoltaic system is finally improved by adjusting the duty ratio, when the output power of the photovoltaic system reaches the maximum, the position of the working point is called a maximum power point, the corresponding duty ratio is the optimal duty ratio, the process is called a scheduling strategy, and then the controller tracks the maximum power point of photovoltaic power generation by taking the scheduling strategy as a basis.
When the temperature and the illumination intensity in the photovoltaic environment change, the controller takes a scheduling strategy as a basis, the output power of an original working point is reduced, the controller adjusts the position of the working point by adjusting the duty ratio so as to improve the output power of the photovoltaic system, and the optimal scheduling strategy can track the maximum power point in the shortest time.
Reinforcement learning is an intelligent approach of the object oriented type. Learners learn through their behavioral consequences without being informed of the behavior. The reinforcement learning method mainly comprises an intelligent agent and an environment, the intelligent agent interacts with the environment through actions by setting corresponding task targets, the reinforcement learning algorithm modifies own action strategies by using data of interaction between the intelligent agent and the environment, and the optimal action strategies of corresponding tasks are finally obtained after repeated iterative learning.
Referring to fig. 2, a flow chart of reinforcement learning is shown. The basic idea of the training model for reinforcement learning is also quite simple, it is a tuple(s) of (state, behavior, reward, next state) constructiont,at,rt+1,st+1) Training a sample, wherein stIs the current state, atFor actions performed in the current state, rt+1For instant rewards obtained after performing an action, st+1The next state. The learning objective of the Q-value network is rt+1+γ·maxaQ(st+1A), the objective function is the reward earned by the current action plus the maximum expected value earned by the next step, in the equation: r ist+1Is the reward for the next stage, gamma is the discount factor, Q(s)t+1The maximum expected value to be achieved next is evaluated by multiplying by a discount factor gamma to assess the impact of future rewards on the current state, and is set to gamma ∈ [0,1 depending on the importance of future rewards in learning]。
The iterative process for a Q-value network is represented as:
Figure BDA0002393628310000081
specifically, the reinforcement learning model training proposed in this embodiment is described as follows:
s21: firstly, in the reinforcement learning process, the intelligent agent is in one state at each moment, and the intelligent agent acts according to the value of the current state and the historical strategy of the intelligent agent. And then, the intelligent agent obtains a new environment observation value and a return from the environment, and the intelligent agent learns according to the new observation value and makes a new action. And circulating to obtain the optimal strategy finally. At each time step, the agent observes the environment quantity containing state stAnd action atAnd a reward function rt. The agent being in a current state stTake action atAnd transits to the next state s through the action function At+1:st+1=A(st,at),
Figure BDA0002393628310000082
The environment is then based on the current state stAnd performing action atAnd the next state st+1By the reward function R: r ist=R(st,at,st+1),
Figure BDA0002393628310000083
And returning.
S22: secondly, after the agent takes action in a certain state s, the agent reports back through definition accumulation
Figure BDA0002393628310000084
To measure the value of the state s to act on;
will Qh(s, a) is called a state-action value function
Figure BDA0002393628310000085
And representing the value of the corresponding strategy made by the intelligent agent under a certain state s and a certain action a.
Definition of Q*(s, a) is the maximum state-action value function of all policies
Figure BDA0002393628310000086
If Q is known*(s, a), then the optimal strategy Gt *By directly maximizing Q*(s, a) determining:
Figure BDA0002393628310000087
the characteristics of the photovoltaic power generation are described as follows:
the maximum power point of photovoltaic power generation is the peak value on the I-V characteristic curve of the photovoltaic array, and when the working point is located at the point, the photovoltaic array achieves the maximum energy conversion efficiency, and the power generated by the photovoltaic power supply is the maximum. The photovoltaic array I-V curve (current-voltage curve) under constant ambient conditions is the current I output by the photovoltaic array at any timePVAnd voltage VPVThe resulting I-V curve of a conventional photovoltaic power source is shown in fig. 3.
Further, the state space in reinforcement learning is described as follows:
aiming at the problem of tracking the maximum power point of photovoltaic power generation, the control strategy is suitable for any photovoltaic power supply. The control method herein uses the following three basic states stDefine S ∈ (I)PV;VPV(ii) a Deg). The first two state variables IPVAnd VPVThe current and voltage of the operating point of the photovoltaic array after standardization and discretization are shown. Thus, the state parameter IPVAnd VPVShort-circuit current I respectively normalized to photovoltaic power supplySCAnd open circuit voltage UOC
A third state quantity Deg is set at the same time, specifically defined as follows:
Figure BDA0002393628310000088
when parameter Deg becomes zero, the maximum power point is reached. When the parameter Deg is negative, the characterization operating point is to the left of the MPP point, and when the parameter Deg is positive, the characterization operating point is to the right of the MPP point. This variable therefore provides a clear separation between the operating point and the MPP point for different operating conditions.
The following describes the motion space in reinforcement learning:
in the case of a photovoltaic power generation maximum power point tracking control problem, action atIs set to the duty cycle of the DC/DC converter. The duty cycle has different optimal values for different operating conditions and photovoltaic power sources and varies from 0 to 1. To ensure the computational efficiency of the algorithm herein, a discrete finite motion space set a is defined herein that includes positive and negative going directions and includes zero change, characterizing the change value of the duty cycle: s ═ S1,s1,...,sn}。
The reward function in reinforcement learning is illustrated as follows:
for the problem of controlling the photovoltaic power generation maximum power point tracking control problem, the intelligent agent is restrained by setting the reward function and the penalty function, so that the speed of algorithm convergence can be effectively increased, and the accuracy after convergence is ensured. Thus, hereinThe following reward function r is usedt=r(at,dt)=r+(at,dt)+r-(at,dt) In the formula: a istRepresents an action value, dtRepresenting the load demand, r+(at,dt) Representing rewards for meeting user load requirements, r-(at,dt) Representing a penalty for not meeting the load demand.
The Q value algorithm in reinforcement learning is explained as follows:
the Q-learning algorithm is a classic strategy algorithm in reinforcement learning, and the specific flow is as follows: firstly, the intelligent agent establishes a Q value table through exploring the environment, and obtains environment feedback rewards through continuously interacting with the environment, so that a state-action pair corresponding Q value is formed in the Q value table, the value in the Q value table is continuously iteratively modified through a Q value updating rule, the probability of selecting positive reward actions is continuously increased, the probability of correspondingly obtained negative reward actions has a continuously reduced trend, and the actions of the intelligent agent finally tend to an optimal action set along with continuous interactive screening with the environment and change of an action strategy set. But the agent cannot always be in a known environment, so an epsilon-greedy strategy is introduced:
Figure BDA0002393628310000091
where ε is a random value and | A(s) | is an action evaluation value. The epsilon-greedy strategy is the most basic and most common strategy in reinforcement learning, and the meaning of the formula is to select the action with the probability of the action which maximizes the action value function as
Figure BDA0002393628310000092
While the probabilities of other actions are equal probabilities. Are all provided with
Figure BDA0002393628310000093
The relation between the utilization and exploration of the intelligent agent in the known environment and the unknown environment is balanced by introducing an epsilon-greedy strategy. Wherein the function of the selected action value is the largestSome are utilization, and any probability of other non-optimal actions is exploration. The updating rule of Q-learning is as follows:
Figure BDA0002393628310000094
the present embodiment sets the state quantity as the current voltage and the angle, the control quantity as the duty ratio, and the setting of the reward function is a limit value for the current voltage change. The control quantity is a duty ratio, and the current voltage is influenced by controlling the change condition of the duty ratio, so that the maximum power point can be tracked finally. Control in the conventional method, the control strategy is both deterministic and fixed, which results in a poor versatility of the conventional control strategy, and when conditions change, the corresponding step size in the control is reset. The control strategy proposed by the embodiment is a Q-value learning method based on reinforcement learning, and belongs to one of artificial intelligence methods. Compared with the traditional method, the embodiment provides a self-learning method, when the environment changes, the constructed intelligent agent can search for a new environment under the constraint of the reward function, the Q value obtained by each search is recorded and is placed in the Q value table, and finally, after the intelligent agent completes all the environments, the optimal strategy, namely the control strategy of the embodiment, is obtained by selecting the action with the largest Q value according to the complete Q table.
Example 2
Referring to the schematic diagrams of fig. 6 to 7, the present embodiment provides a photovoltaic power generation maximum power point tracking control apparatus, and the method of the above embodiment is implemented by depending on the apparatus, and specifically includes a photovoltaic model module 100, where the photovoltaic model module 100 includes a photovoltaic power supply, a DC/DC buck converter, and a resistive load; the reinforcement learning module 200 comprises an agent interacting with the environment, and further comprises a state space model module 201, an action space model module 202, a reward function model module 203 and a Q value algorithm model module 204, which are respectively used for configuring a state space model, an action space model, a reward function model and a Q value algorithm model, so as to realize intelligent tracking of the maximum power point of photovoltaic power generation.
More specifically, the photovoltaic power generation maximum power point tracking structure based on reinforcement learning is shown in fig. 6, and includes a photovoltaic power generation maximum power point tracking control strategy method based on reinforcement learning and a photovoltaic cell model, which are used for tracking a maximum power point.
The method comprises the step of finding an optimal tracking strategy of a maximum power point based on a Q value network (namely an enhanced learning network) of a photovoltaic power generation maximum power point tracking control strategy of the enhanced learning.
The Q value network comprises the following specific working steps:
firstly, in the reinforcement learning process, the intelligent agent is in one state at each moment, and the intelligent agent acts according to the value of the current state and the historical strategy of the intelligent agent. And then, the intelligent agent obtains a new environment observation value and a return from the environment, and the intelligent agent learns according to the new observation value and makes a new action. And circulating to obtain the optimal strategy finally. At each time step, the agent observes the environment quantity containing state stAnd action atAnd a reward function rt. The agent being in a current state stTake action atAnd transits to the next state s through the action function At+1:st+1=A(st,at),
Figure BDA0002393628310000101
The environment is then based on the current state stAnd performing action atAnd the next state st+1Returned by the reward function R.
Secondly, after the agent takes action in a certain state s, we report Q through definition accumulationh(s, a) to measure the value of state s to act on, we will QhAnd (s, a) is called a state-action value function and represents the value of the intelligent agent for making a corresponding strategy under a certain state s and a certain action a. Definition of Q*(s, a) is the maximum state-action value function of all policies, if Q is known*(s, a), then the optimal strategy Gt *By directly maximizing Q*(s, a).
Example 3
In order to evaluate the effectiveness and accuracy of the control strategy proposed by the embodiment, the embodimentThe embodiment is simulated under the conditions that the environmental conditions (illumination intensity and temperature) are fixed, the environmental conditions are changed and the like through simulation respectively. The effectiveness and the accuracy of the control strategy provided by the embodiment are verified. In this embodiment, a control method for tracking the maximum power point based on improved Q-factor reinforcement learning is developed based on OpenAI Gym design using Python, and aims to quickly find the MPP point. In this set of simulations, the present embodiment is based on an open circuit voltage VocAt 37V, short-circuit current ISCAt 8A, respectively under a NOT environment (the temperature is 25 ℃ and the illumination amplitude is 1000W/m)2) Under the condition of STC (the temperature is 47 ℃ and the illumination amplitude is 800W/m)2) And performing simulation to verify the effectiveness of the control strategy provided by the embodiment under the condition of constant temperature and illumination amplitude.
In this simulation, a hardware controller (a system provides a set of current and voltage for a load, but under the condition of the set of current and voltage, the output power of the system does not reach the maximum, at this time, the most appropriate current and voltage needs to be tracked to reach the maximum power, and the tracking process is called a scheduling strategy, so that the system is adjusted by the controller, and the adjusting process tracks the maximum power point of photovoltaic power generation according to the scheduling strategy) is considered, and the set time period T is 0.01 s. In the reinforcement learning algorithm, W is setp=0.9、Wn=0.9、KD=1000、 KC=1000、
Figure BDA0002393628310000111
γ is 0.9 and ε is 0.8. Since the early environment is unknown, the agent will be explored first in the beginning. The random oscillation of the early-stage power error shows an algorithm exploration process, in the process, the intelligent agent randomly selects an action exploration environment and establishes a corresponding Q value (action-value pair) table until the algorithm converges after the intelligent agent completely explores an environment space, and the intelligent agent can effectively track the maximum power point according to the maximum Q value. The two simulations verify the effectiveness of the algorithm proposed by the present embodiment in tracking the MPPT point under the condition of constant temperature and illumination intensity. This time, the effectiveness of the strategy proposed by the embodiment is verified mainly through two simulations, which are specifically as follows: light (es)The trace diagram of the control strategy of power during photovoltaic power generation (fig. 4) shows that the intelligent agent proposed according to the embodiment has NOT (temperature of 25 ℃ and illumination amplitude of 1000W/m)2) Under the environment, after 198(1.98s) times of iterative training, the simulation converges, and as shown in fig. 4, the agent will explore first in the initial stage due to the unknown previous environment, and the stage is shown as an oscillation curve. The random oscillation shows an algorithm exploration process, in the process, the intelligent agent randomly selects an action exploration environment and establishes a corresponding Q value (action-value pair) table, until the algorithm converges after the intelligent agent completely explores an environment space, the intelligent agent can effectively track the maximum power point according to the maximum Q value, so that an exploration curve of the intelligent agent becomes small-amplitude oscillation after 1.98s, and the situation that the intelligent agent can track the maximum power point through a strategy which is obtained by self-learning is shown, so that the effectiveness of the control strategy provided by the invention in the NOT environment is proved.
The trajectory graph (FIG. 5) of the control strategy for power during photovoltaic power generation shows the STC (47 ℃ temperature and 800W/m illumination amplitude) of the agent in the proposed strategy according to this example2) Next, after 280 (2.8) times of iterative training, simulation convergence is performed, and as shown in fig. 4, the intelligent agent first randomly explores, then establishes a corresponding Q value table, and finally obtains an optimal strategy according to the Q value table, and successfully tracks to the maximum power point under the optimal strategy, where the maximum power point successfully tracked by the intelligent agent is proved by that a power difference value (a difference value between the current power and the maximum power) tends to zero, and the effectiveness of the method provided by the present invention in the STC environment is proved.
It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.
Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.
Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein. A computer program can be applied to input data to perform the functions described in the present embodiment to convert the input data to generate output data that is stored to a non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.
As used in this application, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being: a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of example, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).
It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims (10)

1. A photovoltaic power generation maximum power point tracking control method is characterized by comprising the following steps: comprises the steps of (a) preparing a mixture of a plurality of raw materials,
intelligently tracking a photovoltaic power generation maximum power point in a photovoltaic model;
continuously interacting with the environment by using a feedback signal of an agent in reinforcement learning, adjusting and improving intelligent decision-making behaviors, and obtaining an optimal tracking strategy;
and tracking the maximum power point of the photovoltaic power generation in a constantly changing environment according to the optimal tracking strategy decided by the intelligent agent.
2. The photovoltaic power generation maximum power point tracking control method according to claim 1, characterized in that: the intelligent tracking includes the steps of,
modeling and describing a photovoltaic power generation maximum power point tracking process as a Markov decision process;
and constructing a reinforced learning algorithm for tracking the maximum power point of the photovoltaic power generation based on the Markov decision process, wherein the reinforced learning algorithm comprises an environment model, an action space model, a reward function model and a Q value algorithm model, and the intelligent tracking of the maximum power point of the photovoltaic power generation is realized.
3. The photovoltaic power generation maximum power point tracking control method according to claim 2, characterized in that: the photovoltaic model comprises a photovoltaic power supply, a DC/DC direct-current buck converter and a resistance load;
the decision module of the DC/DC direct-current buck converter is configured to be an intelligent body structure for strengthening learning and used for tracking a maximum power tracking point; according to the dynamic change of the power, the change value of the power in unit time is judged, and the scheduling strategy adjusts and tracks the maximum power point of the photovoltaic power generation through the controller.
4. The photovoltaic power generation maximum power point tracking control method according to claim 2 or 3, characterized in that: the reinforcement learning includes the steps of,
setting a corresponding task target;
the intelligent agent interacts with the environment through actions;
the reinforcement learning algorithm utilizes the interaction data of the agent and the environment to modify the action strategy of the agent;
and finally obtaining the optimal action strategy of the corresponding task after iterative learning for a plurality of times.
5. The photovoltaic power generation maximum power point tracking control method according to claim 4, characterized in that: the training model of the reinforcement learning algorithm includes,
tuple(s) consisting of state, behavior, reward, next statet,at,rt+1,st+1) Is a sample;
stis the current state, atFor the action performed in the current state, rt+1For instant rewards, s, obtained after performing an actiont+1The next state;
the learning objective for defining the Q-value network is rt+1+γ·maxaQ(st+1A), the objective function is the reward earned by the current action plus the maximum expected value earned by the next step, where rt+1Is the reward for the next stage, gamma is the discount factor, Q(s)t+1And a) is Q value;
the maximum expected value of the next step is multiplied by a discount factor gamma to evaluate the influence of the future reward on the current state;
setting gamma ∈ [0,1] according to the importance of future rewards in learning;
the iterative process for a Q-value network is represented as:
Figure FDA0002393628300000021
6. the photovoltaic power generation maximum power point tracking control method according to claim 5, characterized in that: the intelligent agent comprises a plurality of intelligent agents,
at each time step, the agent observation environment quantity contains a state stAnd action atAnd a reward function rt
The agent is in the current state stTake action atAnd transferred through action function ATo the next state st+1
Figure FDA0002393628300000022
Environment according to current state stAnd performing action atAnd the next state st+1By the reward function R:
Figure FDA0002393628300000023
returning;
after the agent takes action in a certain state s, accumulated reward is defined
Figure FDA0002393628300000024
To measure the value of the state s to act on;
will Qh(s, a) is called a state-action value function
Figure FDA0002393628300000025
Representing the value of the intelligent agent in making a corresponding strategy in a certain state s and a certain action a;
definition of Q*(s, a) is the maximum state-action value function of all policies
Figure FDA0002393628300000026
If Q is known*(s, a), then the optimal strategy Gt *By directly maximizing Q*(s, a) determining:
Figure FDA0002393628300000027
7. the photovoltaic power generation maximum power point tracking control method according to claim 5 or 6, characterized in that: the state space model comprises a model of the state space,
the following three basic states s are usedtDefine S ∈ (I)PV;VPV(ii) a Deg), the first two state variables IPVAnd VPVRepresenting photovoltaicsThe current and voltage of the normalized and discretized working point of the array;
status parameter IPVAnd VPVShort-circuit current I respectively normalized to photovoltaic power supplySCAnd open circuit voltage UOCMeanwhile, a third state quantity Deg is set, which is specifically defined as follows:
Figure FDA0002393628300000028
when parameter Deg becomes zero, the maximum power point is reached;
when the parameter Deg is negative, the characterization operating point is to the left of the MPP point;
when parameter Deg is positive, the characterization operating point is to the right of the MPP point.
8. The photovoltaic power generation maximum power point tracking control method according to claim 7, characterized in that:
the motion space model comprises a motion space model comprising,
action atIs set to the duty cycle of the DC/DC converter, which has different optimal values for different operating conditions and photovoltaic power sources, and varies from 0 to 1;
to ensure the computational efficiency of the algorithm herein, a discrete finite motion space set a is defined, comprising positive and negative going and comprising zero changes, characterizing the value of the change in duty cycle: s ═ S1,s1,...,sn};
The model of the reward function includes,
the following reward function r is usedt=r(at,dt)=r+(at,dt)+r-(at,dt) In the formula: a istRepresents an action value dtRepresents the load demand, r+(at,dt) Representing rewards, r, to meet user load requirements-(at,dt) Representing a penalty for not meeting the load demand.
9. The photovoltaic power generation maximum power point tracking control method according to claim 8, characterized in that: the Q-value algorithm model comprises the following components,
the agent cannot always be in a known environment, so an epsilon-greedy strategy is introduced:
Figure FDA0002393628300000031
wherein epsilon is a random value, | A(s) | is an action evaluation value;
the epsilon-greedy strategy is the most basic and most common strategy in reinforcement learning, and the meaning of the formula is that the probability of selecting the action which maximizes the action value function is
Figure FDA0002393628300000032
The probabilities of other actions are equal probabilities
Figure FDA0002393628300000033
An epsilon-greedy strategy is introduced to balance the relation between the utilization and the exploration of the intelligent agent in the known environment and the unknown environment, wherein the part with the maximum action value function is selected as the utilization, and any probability of other non-optimal actions is the exploration;
the updating rule of Q-learning is as follows:
Figure FDA0002393628300000034
10. the utility model provides a photovoltaic power generation maximum power point tracking control device which characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,
a photovoltaic model module (100), the photovoltaic model module (100) comprising a photovoltaic power source, a DC/DC direct current buck converter and a resistive load;
the reinforcement learning module (200) comprises an agent interacting with the environment, and the agent further comprises a state space model module (201), an action space model module (202), a reward function model module (203) and a Q value algorithm model module (204), which are respectively used for configuring the state space model, the action space model, the reward function model and the Q value algorithm model and used for realizing intelligent tracking of the maximum power point of photovoltaic power generation.
CN202010123212.4A 2020-02-27 2020-02-27 Photovoltaic power generation maximum power point tracking control method Active CN111324167B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010123212.4A CN111324167B (en) 2020-02-27 2020-02-27 Photovoltaic power generation maximum power point tracking control method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010123212.4A CN111324167B (en) 2020-02-27 2020-02-27 Photovoltaic power generation maximum power point tracking control method

Publications (2)

Publication Number Publication Date
CN111324167A true CN111324167A (en) 2020-06-23
CN111324167B CN111324167B (en) 2022-07-01

Family

ID=71169004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010123212.4A Active CN111324167B (en) 2020-02-27 2020-02-27 Photovoltaic power generation maximum power point tracking control method

Country Status (1)

Country Link
CN (1) CN111324167B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001585A (en) * 2020-07-14 2020-11-27 北京百度网讯科技有限公司 Multi-agent decision method and device, electronic equipment and storage medium
CN114967821A (en) * 2022-03-29 2022-08-30 武汉城市职业学院 Photovoltaic power generation system maximum power tracking control method based on reinforcement learning
WO2023019536A1 (en) * 2021-08-20 2023-02-23 上海电气电站设备有限公司 Deep reinforcement learning-based photovoltaic module intelligent sun tracking method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106487011A (en) * 2016-11-28 2017-03-08 东南大学 A kind of based on the family of Q study microgrid energy optimization method
CN109443366A (en) * 2018-12-20 2019-03-08 北京航空航天大学 A kind of unmanned aerial vehicle group paths planning method based on improvement Q learning algorithm
CN110267338A (en) * 2019-07-08 2019-09-20 西安电子科技大学 Federated resource distribution and Poewr control method in a kind of D2D communication
CN110399006A (en) * 2019-08-28 2019-11-01 江苏提米智能科技有限公司 Two-sided photovoltaic module maximum generating watt angle control method based on big data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106487011A (en) * 2016-11-28 2017-03-08 东南大学 A kind of based on the family of Q study microgrid energy optimization method
CN109443366A (en) * 2018-12-20 2019-03-08 北京航空航天大学 A kind of unmanned aerial vehicle group paths planning method based on improvement Q learning algorithm
CN110267338A (en) * 2019-07-08 2019-09-20 西安电子科技大学 Federated resource distribution and Poewr control method in a kind of D2D communication
CN110399006A (en) * 2019-08-28 2019-11-01 江苏提米智能科技有限公司 Two-sided photovoltaic module maximum generating watt angle control method based on big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KUAN-YU CHOU等: "Maximum Power Point Tracking of Photovoltaic System Based on Reinforcement Learning", 《2019 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001585A (en) * 2020-07-14 2020-11-27 北京百度网讯科技有限公司 Multi-agent decision method and device, electronic equipment and storage medium
CN112001585B (en) * 2020-07-14 2023-09-22 北京百度网讯科技有限公司 Multi-agent decision method, device, electronic equipment and storage medium
WO2023019536A1 (en) * 2021-08-20 2023-02-23 上海电气电站设备有限公司 Deep reinforcement learning-based photovoltaic module intelligent sun tracking method
CN114967821A (en) * 2022-03-29 2022-08-30 武汉城市职业学院 Photovoltaic power generation system maximum power tracking control method based on reinforcement learning

Also Published As

Publication number Publication date
CN111324167B (en) 2022-07-01

Similar Documents

Publication Publication Date Title
CN111324167B (en) Photovoltaic power generation maximum power point tracking control method
Qiang et al. Reinforcement learning model, algorithms and its application
Ekinci et al. An effective control design approach based on novel enhanced aquila optimizer for automatic voltage regulator
Ghamari et al. Fractional‐order fuzzy PID controller design on buck converter with antlion optimization algorithm
Daniel et al. Active reward learning with a novel acquisition function
Zhang et al. Reinforcement Learning in Robot Path Optimization.
CN115470704B (en) Dynamic multi-objective optimization method, device, equipment and computer readable medium
Yin et al. Multi-step depth model predictive control for photovoltaic power systems based on maximum power point tracking techniques
Kubalík et al. Optimal control via reinforcement learning with symbolic policy approximation
Luo et al. An adaptive adjustment strategy for bolt posture errors based on an improved reinforcement learning algorithm
Nagy et al. Reinforcement learning for intelligent environments: A Tutorial
CN109635915A (en) A kind of iterative learning control method based on balanced single evolution cuckoo algorithm
CN113885328A (en) Nuclear power tracking control method based on integral reinforcement learning
Yu et al. A robust method based on reinforcement learning and differential evolution for the optimal photovoltaic parameter extraction
CN114139778A (en) Wind turbine generator power prediction modeling method and device
Kumar et al. An adaptive particle swarm optimization algorithm for robust trajectory tracking of a class of under actuated system
Bolland et al. Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent
Arshad et al. Deep Deterministic Policy Gradient to Regulate Feedback Control Systems Using Reinforcement Learning.
CN111222718A (en) Maximum power point tracking method and device of wind energy conversion system
WO2022248720A1 (en) Multi-objective reinforcement learning using weighted policy projection
Yang et al. PMDRL: Pareto-front-based multi-objective deep reinforcement learning
Dey et al. Reinforcement Learning Building Control: An Online Approach with Guided Exploration using Surrogate Models
Zhang et al. Using partial-policy q-learning to plan path for robot navigation in unknown enviroment
Korich et al. Shifting of Nonlinear Phenomenon in the Boost converter Using Aquila Optimizer
Hu et al. Q-value Regularized Transformer for Offline Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant