CN111324167A

CN111324167A - Photovoltaic power generation maximum power point tracking control method and device

Info

Publication number: CN111324167A
Application number: CN202010123212.4A
Authority: CN
Inventors: 崔承刚; 钱申晟; 官乐乐; 杨宁; 张传林; 陈辉
Original assignee: Shanghai Electric Power University
Current assignee: Shanghai Electric Power University
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-06-23
Anticipated expiration: 2040-02-27
Also published as: CN111324167B

Abstract

The invention discloses a photovoltaic power generation maximum power point tracking control method and a device, which comprises the steps of intelligently tracking a photovoltaic power generation maximum power point in a photovoltaic model; continuously interacting with the environment by using a feedback signal of an agent in reinforcement learning, adjusting and improving intelligent decision-making behaviors, and obtaining an optimal tracking strategy; and the intelligent agent decides an optimal energy storage scheduling strategy and tracks the maximum power point of the photovoltaic power generation in a continuously changing environment. The invention has the beneficial effects that: the algorithm is universal under the conditions of fixed environmental conditions and no prior knowledge, the system has a simple structure, is not easy to generate misjudgment, and can accurately track the maximum power point; and under the condition of sudden change of environmental conditions, the control strategy can also track the maximum power point more quickly.

Description

Photovoltaic power generation maximum power point tracking control method and device

Technical Field

The invention relates to the technical field of photovoltaic power generation maximum power point tracking sources, in particular to a photovoltaic power generation maximum power point tracking control method and device based on reinforcement learning.

Background

In recent years, the industry always searches the maximum power point through a traditional control theory method. The traditional fixed voltage tracking method is simple to control and high in tracking speed, but in places with severe changes of environmental conditions, the method is poor in control accuracy. The disturbance observation method has a simple overall structure and a small disturbance parameter, but the method needs a more accurate step length, and the method is easy to generate a 'misjudgment' phenomenon. The traditional conductance increment method has high tracking precision, but the method depends on a microprocessor or a digital signal processor, thereby leading to complex system structure and high cost.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above-mentioned conventional problems.

Therefore, the technical problem solved by the invention is as follows: the photovoltaic power generation maximum power point tracking control method is provided, and the problems that a large amount of accurate prior experience is needed and misjudgment is easy to occur in the traditional method are solved.

In order to solve the technical problems, the invention provides the following technical scheme: a photovoltaic power generation maximum power point tracking control method comprises the steps of intelligently tracking a photovoltaic power generation maximum power point in a photovoltaic model; continuously interacting with the environment by using a feedback signal of an agent in reinforcement learning, adjusting and improving intelligent decision-making behaviors, and obtaining an optimal tracking strategy; and the intelligent agent decides an optimal energy storage scheduling strategy and tracks the maximum power point of the photovoltaic power generation in a continuously changing environment.

As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the intelligent tracking comprises the steps of modeling and describing a photovoltaic power generation maximum power point tracking process as a Markov decision process; and constructing a reinforced learning algorithm for tracking the maximum power point of the photovoltaic power generation based on the Markov decision process, wherein the reinforced learning algorithm comprises an environment model, an action space model, a reward function model and a Q value algorithm model, and the intelligent tracking of the maximum power point of the photovoltaic power generation is realized.

As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the photovoltaic model comprises a photovoltaic power supply, a DC/DC direct-current buck converter and a resistance load; the decision module of the DC/DC direct-current buck converter is configured to be an intelligent body structure for strengthening learning and used for tracking a maximum power tracking point; according to the dynamic change of the power, the change value of the power in unit time is judged, and the scheduling strategy adjusts and tracks the maximum power point of the photovoltaic power generation through the controller.

As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the reinforcement learning comprises the steps of setting corresponding task targets; the intelligent agent interacts with the environment through actions; the reinforcement learning algorithm utilizes the interaction data of the agent and the environment to modify the action strategy of the agent; and finally obtaining the optimal action strategy of the corresponding task after iterative learning for a plurality of times.

As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the training model of the reinforcement learning algorithm comprises tuples(s) formed by state, behavior, reward and next state_t,a_t,r_t+1,s_t+1) Is a sample; s_tIs the current state, a_tFor the action performed in the current state, r_t+1For instant rewards, s, obtained after performing an action_t+1The next state; the learning objective for defining the Q-value network is r_t+1+γ·max_aQ(s_t+1A), the objective function is the reward earned by the current action plus the maximum expected value earned by the next step, where r_t+1Is the reward for the next stage, gamma is the discount factor, Q(s)_t+1And a) is Q value; the maximum expected value of the next step is multiplied by a discount factor gamma to evaluate future rewards versus current statusGamma ∈ [0,1] is set according to the importance of future awards in learning](ii) a The iterative process for a Q-value network is represented as:

as a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the agent comprises, at each time step, the agent observing an ambient quantity inclusion state s_tAnd action a_tAnd a reward function r_t(ii) a The agent is in the current state s_tTake action a_tAnd transits to the next state s through the action function A_t+1：s_t+1＝A(s_t,a_t),

Environment according to current state s_tAnd performing action a_tAnd the next state s_t+1By the reward function R: r is_t＝R(s_t,a_t,s_t+1),

Returning; after the agent takes action in a certain state s, accumulated reward is defined

To measure the value of the state s to act on; will Q_h(s, a) is called a state-action value function

Representing the value of the intelligent agent in making a corresponding strategy in a certain state s and a certain action a; definition of Q^*(s, a) is the maximum state-action value function of all policies

If Q is known^*(s, a), then the optimal strategy G_t ^*By directly maximizing Q^*(s, a) determining:

as a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the state space model comprises the use of the following three basic states s_tDefine S ∈ (I)_PV；V_PV(ii) a Deg), the first two state variables I_PVAnd V_PVCurrent, voltage representing the normalized and discretized operating point of the photovoltaic array; status parameter I_PVAnd V_PVShort-circuit current I respectively normalized to photovoltaic power supply_SCAnd open circuit voltage U_OCMeanwhile, a third state quantity Deg is set, which is specifically defined as follows:

when parameter Deg becomes zero, the maximum power point is reached;

when the parameter Deg is negative, the characterization operating point is to the left of the MPP point; when parameter Deg is positive, the characterization operating point is to the right of the MPP point.

As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the motion space model comprises a motion a_tIs set to the duty cycle of the DC/DC converter, which has different optimal values for different operating conditions and photovoltaic power sources, and varies from 0 to 1; to ensure the computational efficiency of the algorithm herein, a discrete finite motion space set a is defined, comprising positive and negative going and comprising zero changes, characterizing the value of the change in duty cycle: s ═ S₁,s₁,...,s_n}; the reward function model includes the use of the following reward function r_t＝r(a_t,d_t)＝r⁺(a_t,d_t)+r^-(a_t,d_t) In the formula: a is_tRepresents an action value d_tRepresents the load demand, r⁺(a_t,d_t) Representing rewards, r, to meet user load requirements^-(a_t,d_t) Is shown asThe penalty of the load demand can be met.

As a preferable scheme of the photovoltaic power generation maximum power point tracking control method of the present invention, wherein: the Q-value algorithm model comprises that an agent cannot be always in a known environment, so an epsilon-greedy strategy is introduced:

wherein epsilon is a random value, | A(s) | is an action evaluation value; the epsilon-greedy strategy is the most basic and most common strategy in reinforcement learning, and the meaning of the formula is that the probability of selecting the action which maximizes the action value function is

The probabilities of other actions are equal probabilities

An epsilon-greedy strategy is introduced to balance the relation between the utilization and the exploration of the intelligent agent in the known environment and the unknown environment, wherein the part with the maximum action value function is selected as the utilization, and any probability of other non-optimal actions is the exploration; the updating rule of Q-learning is as follows:

the technical problem solved by the invention is as follows: the photovoltaic power generation maximum power point tracking control device is provided, and the problems that a large amount of accurate prior experience is needed and misjudgment is easy to occur in the traditional method are solved.

In order to solve the technical problems, the invention provides the following technical scheme: a photovoltaic power generation maximum power point tracking control device comprises a photovoltaic model module, a control module and a control module, wherein the photovoltaic model module comprises a photovoltaic power supply, a DC/DC direct current buck converter and a resistance load; the reinforcement learning module comprises an intelligent agent interacting with the environment, and the intelligent agent further comprises a state space model module, an action space model module, a reward function model module and a Q value algorithm model module which are respectively used for configuring the state space model, the action space model, the reward function model and the Q value algorithm model, so that the intelligent tracking of the maximum power point of the photovoltaic power generation is realized.

The invention has the beneficial effects that: the algorithm is universal under the conditions of fixed environmental conditions and no prior knowledge, the system has a simple structure, is not easy to generate misjudgment, and can accurately track the maximum power point; and under the condition of sudden change of environmental conditions, the control strategy can also track the maximum power point more quickly.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

FIG. 1 is a schematic diagram of an equivalent circuit structure of a reinforcement learning photovoltaic cell according to the present invention;

FIG. 2 is a schematic diagram of a reinforcement learning process according to the present invention;

FIG. 3 is a schematic diagram of the I-V curve of the photovoltaic power supply of the present invention;

FIG. 4 is a schematic diagram of a power control strategy trajectory in a photovoltaic power generation process according to the present invention;

FIG. 5 is a schematic diagram of a power control strategy trajectory in another photovoltaic power generation process according to the present invention;

FIG. 6 is a schematic diagram of a maximum power point tracking structure of photovoltaic power generation according to the present invention;

fig. 7 is a schematic structural diagram of a photovoltaic power generation maximum power point tracking control device according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1

In the embodiment, a photovoltaic power generation maximum power point tracking control method is provided for solving the problem of the existing photovoltaic power generation maximum power point tracking source, and specifically, the photovoltaic power generation maximum power point tracking control method based on Q value reinforcement learning.

A Q value reinforcement learning method in an artificial intelligence method is used, and the reinforcement learning is a model-free and self-learning control method. Based on the characteristics of reinforcement learning and autonomous learning, the control strategy provided by the embodiment can overcome the defects that a large amount of precise prior experience is needed, misjudgment is easy to occur and the like in the traditional method. And the intelligent agent continuously interacts with the Q table to obtain the optimal strategy of the photovoltaic power generation maximum power point tracking control strategy, and the maximum power point of the photovoltaic power generation is tracked in a continuously changing environment.

And (3) continuously interacting with the environment by using a feedback signal of the environment to the intelligent agent, adjusting and improving the intelligent decision-making behavior, and obtaining an optimal strategy. And the intelligent agent makes a decision on an optimal energy storage scheduling strategy through interaction with the environment, and tracks the maximum power point in a continuously changing environment.

Firstly, modeling a photovoltaic power generation maximum power point tracking process is described as a Markov decision process in the embodiment; then, modules such as an environment model, an action space model, a reward function model, a Q value algorithm and the like for tracking the maximum power point of photovoltaic power generation are developed based on a Markov decision process, and intelligent control of tracking the maximum power point of photovoltaic power generation is achieved.

Compared with the prior art, the photovoltaic power generation maximum power point tracking control strategy is realized by utilizing a Q-Learning reinforcement Learning algorithm based on the photovoltaic power generation maximum power point tracking characteristic in the embodiment.

The photovoltaic power generation maximum power point tracking control method specifically comprises the following steps:

s1: intelligently tracking a photovoltaic power generation maximum power point in a photovoltaic model;

the intelligent tracking in this step includes,

modeling and describing a photovoltaic power generation maximum power point tracking process as a Markov decision process;

and constructing a reinforced learning algorithm for tracking the maximum power point of the photovoltaic power generation based on the Markov decision process, wherein the reinforced learning algorithm comprises an environment model, an action space model, a reward function model and a Q value algorithm model, and the intelligent tracking of the maximum power point of the photovoltaic power generation is realized. The environmental model refers to the background of the process of tracking the maximum power point of photovoltaic power generation, and can be understood as modeling of devices such as a photovoltaic power supply, a DC/DC converter, a load and the like. And the environment model refers to the world in which the agent operates. This environment takes as input the current state and actions of the agent and the output is the state of the agent's rewards and next steps. For the embodiment, the environment model specifically includes: the photovoltaic power generation maximum power point, the state value is the position of the working point, and the action is the increase and decrease of the duty ratio.

S2: continuously interacting with the environment by using a feedback signal of an intelligent agent for reinforcement learning, adjusting and improving intelligent decision-making behaviors and obtaining an optimal tracking strategy; the intelligent agent decides an optimal tracking strategy, namely, the maximum power point of photovoltaic power generation is tracked in a constantly changing environment, namely, when the temperature and the illumination intensity in the photovoltaic environment change, the output power of an original working point can be reduced, the output power of the photovoltaic system is finally improved by adjusting the position of the working point through adjusting the duty ratio, when the output power of the photovoltaic system reaches the maximum, the position of the working point is called as the maximum power point, the corresponding duty ratio is the optimal duty ratio, and the optimal tracking strategy is a process that the intelligent agent can track the optimal duty ratio at the fastest speed through the strategy. The intelligent agent obtains after training each time, the intelligent agent interacts with the environment each time to obtain different strategies, and the optimal tracking strategy is finally trained through the reward of training each time.

S3: the intelligent agent is interacted with the environment to decide an optimal energy storage scheduling strategy (namely an optimal tracking strategy), and the maximum power point of photovoltaic power generation is tracked in a constantly changing environment. The scheduling policy can be understood as: the system provides a set of current and voltage for the load, but under the condition of the set of current and voltage, the output power of the system does not reach the maximum, and at this time, the most appropriate current and voltage is required to be tracked to reach the maximum power, and the tracking process is called as a scheduling strategy.

The photovoltaic model in the embodiment comprises a photovoltaic power supply, a DC/DC direct-current buck converter, a resistance load and the like. The structure of the equivalent circuit of the photovoltaic model is shown in figure 1. The decision module of the DC/DC buck converter is configured into an intelligent body structure for strengthening learning, and plays a role in tracking the maximum power tracking point. According to the dynamic change of the power, when the change value of the power in unit time is judged, the scheduling strategy adjusts and tracks the maximum power point of the photovoltaic power generation through the controller.

The scheduling strategy specifically comprises: when the temperature and the illumination intensity in a photovoltaic environment change, the output power of an original working point is reduced, the output power of a photovoltaic system is finally improved by adjusting the duty ratio, when the output power of the photovoltaic system reaches the maximum, the position of the working point is called a maximum power point, the corresponding duty ratio is the optimal duty ratio, the process is called a scheduling strategy, and then the controller tracks the maximum power point of photovoltaic power generation by taking the scheduling strategy as a basis.

When the temperature and the illumination intensity in the photovoltaic environment change, the controller takes a scheduling strategy as a basis, the output power of an original working point is reduced, the controller adjusts the position of the working point by adjusting the duty ratio so as to improve the output power of the photovoltaic system, and the optimal scheduling strategy can track the maximum power point in the shortest time.

Reinforcement learning is an intelligent approach of the object oriented type. Learners learn through their behavioral consequences without being informed of the behavior. The reinforcement learning method mainly comprises an intelligent agent and an environment, the intelligent agent interacts with the environment through actions by setting corresponding task targets, the reinforcement learning algorithm modifies own action strategies by using data of interaction between the intelligent agent and the environment, and the optimal action strategies of corresponding tasks are finally obtained after repeated iterative learning.

Referring to fig. 2, a flow chart of reinforcement learning is shown. The basic idea of the training model for reinforcement learning is also quite simple, it is a tuple(s) of (state, behavior, reward, next state) construction_t,a_t,r_t+1,s_t+1) Training a sample, wherein s_tIs the current state, a_tFor actions performed in the current state, r_t+1For instant rewards obtained after performing an action, s_t+1The next state. The learning objective of the Q-value network is r_t+1+γ·max_aQ(s_t+1A), the objective function is the reward earned by the current action plus the maximum expected value earned by the next step, in the equation: r is_t+1Is the reward for the next stage, gamma is the discount factor, Q(s)_t+1The maximum expected value to be achieved next is evaluated by multiplying by a discount factor gamma to assess the impact of future rewards on the current state, and is set to gamma ∈ [0,1 depending on the importance of future rewards in learning]。

The iterative process for a Q-value network is represented as:

specifically, the reinforcement learning model training proposed in this embodiment is described as follows:

s21: firstly, in the reinforcement learning process, the intelligent agent is in one state at each moment, and the intelligent agent acts according to the value of the current state and the historical strategy of the intelligent agent. And then, the intelligent agent obtains a new environment observation value and a return from the environment, and the intelligent agent learns according to the new observation value and makes a new action. And circulating to obtain the optimal strategy finally. At each time step, the agent observes the environment quantity containing state s_tAnd action a_tAnd a reward function r_t. The agent being in a current state s_tTake action a_tAnd transits to the next state s through the action function A_t+1：s_t+1＝A(s_t,a_t),

The environment is then based on the current state s_tAnd performing action a_tAnd the next state s_t+1By the reward function R: r is_t＝R(s_t,a_t,s_t+1),

And returning.

S22: secondly, after the agent takes action in a certain state s, the agent reports back through definition accumulation

To measure the value of the state s to act on;

will Q_h(s, a) is called a state-action value function

And representing the value of the corresponding strategy made by the intelligent agent under a certain state s and a certain action a.

Definition of Q^*(s, a) is the maximum state-action value function of all policies

the characteristics of the photovoltaic power generation are described as follows:

the maximum power point of photovoltaic power generation is the peak value on the I-V characteristic curve of the photovoltaic array, and when the working point is located at the point, the photovoltaic array achieves the maximum energy conversion efficiency, and the power generated by the photovoltaic power supply is the maximum. The photovoltaic array I-V curve (current-voltage curve) under constant ambient conditions is the current I output by the photovoltaic array at any time_PVAnd voltage V_PVThe resulting I-V curve of a conventional photovoltaic power source is shown in fig. 3.

Further, the state space in reinforcement learning is described as follows:

aiming at the problem of tracking the maximum power point of photovoltaic power generation, the control strategy is suitable for any photovoltaic power supply. The control method herein uses the following three basic states s_tDefine S ∈ (I)_PV；V_PV(ii) a Deg). The first two state variables I_PVAnd V_PVThe current and voltage of the operating point of the photovoltaic array after standardization and discretization are shown. Thus, the state parameter I_PVAnd V_PVShort-circuit current I respectively normalized to photovoltaic power supply_SCAnd open circuit voltage U_OC。

A third state quantity Deg is set at the same time, specifically defined as follows:

when parameter Deg becomes zero, the maximum power point is reached. When the parameter Deg is negative, the characterization operating point is to the left of the MPP point, and when the parameter Deg is positive, the characterization operating point is to the right of the MPP point. This variable therefore provides a clear separation between the operating point and the MPP point for different operating conditions.

The following describes the motion space in reinforcement learning:

in the case of a photovoltaic power generation maximum power point tracking control problem, action a_tIs set to the duty cycle of the DC/DC converter. The duty cycle has different optimal values for different operating conditions and photovoltaic power sources and varies from 0 to 1. To ensure the computational efficiency of the algorithm herein, a discrete finite motion space set a is defined herein that includes positive and negative going directions and includes zero change, characterizing the change value of the duty cycle: s ═ S₁,s₁,...,s_n}。

The reward function in reinforcement learning is illustrated as follows:

for the problem of controlling the photovoltaic power generation maximum power point tracking control problem, the intelligent agent is restrained by setting the reward function and the penalty function, so that the speed of algorithm convergence can be effectively increased, and the accuracy after convergence is ensured. Thus, hereinThe following reward function r is used_t＝r(a_t,d_t)＝r⁺(a_t,d_t)+r^-(a_t,d_t) In the formula: a is_tRepresents an action value, d_tRepresenting the load demand, r⁺(a_t,d_t) Representing rewards for meeting user load requirements, r^-(a_t,d_t) Representing a penalty for not meeting the load demand.

The Q value algorithm in reinforcement learning is explained as follows:

the Q-learning algorithm is a classic strategy algorithm in reinforcement learning, and the specific flow is as follows: firstly, the intelligent agent establishes a Q value table through exploring the environment, and obtains environment feedback rewards through continuously interacting with the environment, so that a state-action pair corresponding Q value is formed in the Q value table, the value in the Q value table is continuously iteratively modified through a Q value updating rule, the probability of selecting positive reward actions is continuously increased, the probability of correspondingly obtained negative reward actions has a continuously reduced trend, and the actions of the intelligent agent finally tend to an optimal action set along with continuous interactive screening with the environment and change of an action strategy set. But the agent cannot always be in a known environment, so an epsilon-greedy strategy is introduced:

where ε is a random value and | A(s) | is an action evaluation value. The epsilon-greedy strategy is the most basic and most common strategy in reinforcement learning, and the meaning of the formula is to select the action with the probability of the action which maximizes the action value function as

While the probabilities of other actions are equal probabilities. Are all provided with

The relation between the utilization and exploration of the intelligent agent in the known environment and the unknown environment is balanced by introducing an epsilon-greedy strategy. Wherein the function of the selected action value is the largestSome are utilization, and any probability of other non-optimal actions is exploration. The updating rule of Q-learning is as follows:

the present embodiment sets the state quantity as the current voltage and the angle, the control quantity as the duty ratio, and the setting of the reward function is a limit value for the current voltage change. The control quantity is a duty ratio, and the current voltage is influenced by controlling the change condition of the duty ratio, so that the maximum power point can be tracked finally. Control in the conventional method, the control strategy is both deterministic and fixed, which results in a poor versatility of the conventional control strategy, and when conditions change, the corresponding step size in the control is reset. The control strategy proposed by the embodiment is a Q-value learning method based on reinforcement learning, and belongs to one of artificial intelligence methods. Compared with the traditional method, the embodiment provides a self-learning method, when the environment changes, the constructed intelligent agent can search for a new environment under the constraint of the reward function, the Q value obtained by each search is recorded and is placed in the Q value table, and finally, after the intelligent agent completes all the environments, the optimal strategy, namely the control strategy of the embodiment, is obtained by selecting the action with the largest Q value according to the complete Q table.

Example 2

Referring to the schematic diagrams of fig. 6 to 7, the present embodiment provides a photovoltaic power generation maximum power point tracking control apparatus, and the method of the above embodiment is implemented by depending on the apparatus, and specifically includes a photovoltaic model module 100, where the photovoltaic model module 100 includes a photovoltaic power supply, a DC/DC buck converter, and a resistive load; the reinforcement learning module 200 comprises an agent interacting with the environment, and further comprises a state space model module 201, an action space model module 202, a reward function model module 203 and a Q value algorithm model module 204, which are respectively used for configuring a state space model, an action space model, a reward function model and a Q value algorithm model, so as to realize intelligent tracking of the maximum power point of photovoltaic power generation.

More specifically, the photovoltaic power generation maximum power point tracking structure based on reinforcement learning is shown in fig. 6, and includes a photovoltaic power generation maximum power point tracking control strategy method based on reinforcement learning and a photovoltaic cell model, which are used for tracking a maximum power point.

The method comprises the step of finding an optimal tracking strategy of a maximum power point based on a Q value network (namely an enhanced learning network) of a photovoltaic power generation maximum power point tracking control strategy of the enhanced learning.

The Q value network comprises the following specific working steps:

firstly, in the reinforcement learning process, the intelligent agent is in one state at each moment, and the intelligent agent acts according to the value of the current state and the historical strategy of the intelligent agent. And then, the intelligent agent obtains a new environment observation value and a return from the environment, and the intelligent agent learns according to the new observation value and makes a new action. And circulating to obtain the optimal strategy finally. At each time step, the agent observes the environment quantity containing state s_tAnd action a_tAnd a reward function r_t. The agent being in a current state s_tTake action a_tAnd transits to the next state s through the action function A_t+1：s_t+1＝A(s_t,a_t),

The environment is then based on the current state s_tAnd performing action a_tAnd the next state s_t+1Returned by the reward function R.

Secondly, after the agent takes action in a certain state s, we report Q through definition accumulation_h(s, a) to measure the value of state s to act on, we will Q_hAnd (s, a) is called a state-action value function and represents the value of the intelligent agent for making a corresponding strategy under a certain state s and a certain action a. Definition of Q^*(s, a) is the maximum state-action value function of all policies, if Q is known^*(s, a), then the optimal strategy G_t ^*By directly maximizing Q^*(s, a).

Example 3

In order to evaluate the effectiveness and accuracy of the control strategy proposed by the embodiment, the embodimentThe embodiment is simulated under the conditions that the environmental conditions (illumination intensity and temperature) are fixed, the environmental conditions are changed and the like through simulation respectively. The effectiveness and the accuracy of the control strategy provided by the embodiment are verified. In this embodiment, a control method for tracking the maximum power point based on improved Q-factor reinforcement learning is developed based on OpenAI Gym design using Python, and aims to quickly find the MPP point. In this set of simulations, the present embodiment is based on an open circuit voltage V_ocAt 37V, short-circuit current I_SCAt 8A, respectively under a NOT environment (the temperature is 25 ℃ and the illumination amplitude is 1000W/m)²) Under the condition of STC (the temperature is 47 ℃ and the illumination amplitude is 800W/m)²) And performing simulation to verify the effectiveness of the control strategy provided by the embodiment under the condition of constant temperature and illumination amplitude.

In this simulation, a hardware controller (a system provides a set of current and voltage for a load, but under the condition of the set of current and voltage, the output power of the system does not reach the maximum, at this time, the most appropriate current and voltage needs to be tracked to reach the maximum power, and the tracking process is called a scheduling strategy, so that the system is adjusted by the controller, and the adjusting process tracks the maximum power point of photovoltaic power generation according to the scheduling strategy) is considered, and the set time period T is 0.01 s. In the reinforcement learning algorithm, W is set_p＝0.9、W_n＝0.9、K_D＝1000、 K_C＝1000、

γ is 0.9 and ε is 0.8. Since the early environment is unknown, the agent will be explored first in the beginning. The random oscillation of the early-stage power error shows an algorithm exploration process, in the process, the intelligent agent randomly selects an action exploration environment and establishes a corresponding Q value (action-value pair) table until the algorithm converges after the intelligent agent completely explores an environment space, and the intelligent agent can effectively track the maximum power point according to the maximum Q value. The two simulations verify the effectiveness of the algorithm proposed by the present embodiment in tracking the MPPT point under the condition of constant temperature and illumination intensity. This time, the effectiveness of the strategy proposed by the embodiment is verified mainly through two simulations, which are specifically as follows: light (es)The trace diagram of the control strategy of power during photovoltaic power generation (fig. 4) shows that the intelligent agent proposed according to the embodiment has NOT (temperature of 25 ℃ and illumination amplitude of 1000W/m)²) Under the environment, after 198(1.98s) times of iterative training, the simulation converges, and as shown in fig. 4, the agent will explore first in the initial stage due to the unknown previous environment, and the stage is shown as an oscillation curve. The random oscillation shows an algorithm exploration process, in the process, the intelligent agent randomly selects an action exploration environment and establishes a corresponding Q value (action-value pair) table, until the algorithm converges after the intelligent agent completely explores an environment space, the intelligent agent can effectively track the maximum power point according to the maximum Q value, so that an exploration curve of the intelligent agent becomes small-amplitude oscillation after 1.98s, and the situation that the intelligent agent can track the maximum power point through a strategy which is obtained by self-learning is shown, so that the effectiveness of the control strategy provided by the invention in the NOT environment is proved.

The trajectory graph (FIG. 5) of the control strategy for power during photovoltaic power generation shows the STC (47 ℃ temperature and 800W/m illumination amplitude) of the agent in the proposed strategy according to this example²) Next, after 280 (2.8) times of iterative training, simulation convergence is performed, and as shown in fig. 4, the intelligent agent first randomly explores, then establishes a corresponding Q value table, and finally obtains an optimal strategy according to the Q value table, and successfully tracks to the maximum power point under the optimal strategy, where the maximum power point successfully tracked by the intelligent agent is proved by that a power difference value (a difference value between the current power and the maximum power) tends to zero, and the effectiveness of the method provided by the present invention in the STC environment is proved.

It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein. A computer program can be applied to input data to perform the functions described in the present embodiment to convert the input data to generate output data that is stored to a non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

As used in this application, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being: a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of example, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A photovoltaic power generation maximum power point tracking control method is characterized by comprising the following steps: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

intelligently tracking a photovoltaic power generation maximum power point in a photovoltaic model;

continuously interacting with the environment by using a feedback signal of an agent in reinforcement learning, adjusting and improving intelligent decision-making behaviors, and obtaining an optimal tracking strategy;

and tracking the maximum power point of the photovoltaic power generation in a constantly changing environment according to the optimal tracking strategy decided by the intelligent agent.

2. The photovoltaic power generation maximum power point tracking control method according to claim 1, characterized in that: the intelligent tracking includes the steps of,

and constructing a reinforced learning algorithm for tracking the maximum power point of the photovoltaic power generation based on the Markov decision process, wherein the reinforced learning algorithm comprises an environment model, an action space model, a reward function model and a Q value algorithm model, and the intelligent tracking of the maximum power point of the photovoltaic power generation is realized.

3. The photovoltaic power generation maximum power point tracking control method according to claim 2, characterized in that: the photovoltaic model comprises a photovoltaic power supply, a DC/DC direct-current buck converter and a resistance load;

the decision module of the DC/DC direct-current buck converter is configured to be an intelligent body structure for strengthening learning and used for tracking a maximum power tracking point; according to the dynamic change of the power, the change value of the power in unit time is judged, and the scheduling strategy adjusts and tracks the maximum power point of the photovoltaic power generation through the controller.

4. The photovoltaic power generation maximum power point tracking control method according to claim 2 or 3, characterized in that: the reinforcement learning includes the steps of,

setting a corresponding task target;

the intelligent agent interacts with the environment through actions;

the reinforcement learning algorithm utilizes the interaction data of the agent and the environment to modify the action strategy of the agent;

and finally obtaining the optimal action strategy of the corresponding task after iterative learning for a plurality of times.

5. The photovoltaic power generation maximum power point tracking control method according to claim 4, characterized in that: the training model of the reinforcement learning algorithm includes,

tuple(s) consisting of state, behavior, reward, next state_t,a_t,r_t+1,s_t+1) Is a sample;

s_tis the current state, a_tFor the action performed in the current state, r_t+1For instant rewards, s, obtained after performing an action_t+1The next state;

the learning objective for defining the Q-value network is r_t+1+γ·max_aQ(s_t+1A), the objective function is the reward earned by the current action plus the maximum expected value earned by the next step, where r_t+1Is the reward for the next stage, gamma is the discount factor, Q(s)_t+1And a) is Q value;

the maximum expected value of the next step is multiplied by a discount factor gamma to evaluate the influence of the future reward on the current state;

setting gamma ∈ [0,1] according to the importance of future rewards in learning;

the iterative process for a Q-value network is represented as:

6. the photovoltaic power generation maximum power point tracking control method according to claim 5, characterized in that: the intelligent agent comprises a plurality of intelligent agents,

at each time step, the agent observation environment quantity contains a state s_tAnd action a_tAnd a reward function r_t；

The agent is in the current state s_tTake action a_tAnd transferred through action function ATo the next state s_t+1：

Environment according to current state s_tAnd performing action a_tAnd the next state s_t+1By the reward function R:

returning;

after the agent takes action in a certain state s, accumulated reward is defined

To measure the value of the state s to act on;

will Q_h(s, a) is called a state-action value function

Representing the value of the intelligent agent in making a corresponding strategy in a certain state s and a certain action a;

7. the photovoltaic power generation maximum power point tracking control method according to claim 5 or 6, characterized in that: the state space model comprises a model of the state space,

the following three basic states s are used_tDefine S ∈ (I)_PV；V_PV(ii) a Deg), the first two state variables I_PVAnd V_PVRepresenting photovoltaicsThe current and voltage of the normalized and discretized working point of the array;

status parameter I_PVAnd V_PVShort-circuit current I respectively normalized to photovoltaic power supply_SCAnd open circuit voltage U_OCMeanwhile, a third state quantity Deg is set, which is specifically defined as follows:

when parameter Deg becomes zero, the maximum power point is reached;

when the parameter Deg is negative, the characterization operating point is to the left of the MPP point;

when parameter Deg is positive, the characterization operating point is to the right of the MPP point.

8. The photovoltaic power generation maximum power point tracking control method according to claim 7, characterized in that:

the motion space model comprises a motion space model comprising,

action a_tIs set to the duty cycle of the DC/DC converter, which has different optimal values for different operating conditions and photovoltaic power sources, and varies from 0 to 1;

to ensure the computational efficiency of the algorithm herein, a discrete finite motion space set a is defined, comprising positive and negative going and comprising zero changes, characterizing the value of the change in duty cycle: s ═ S₁,s₁,...,s_n}；

The model of the reward function includes,

the following reward function r is used_t＝r(a_t,d_t)＝r⁺(a_t,d_t)+r^-(a_t,d_t) In the formula: a is_tRepresents an action value d_tRepresents the load demand, r⁺(a_t,d_t) Representing rewards, r, to meet user load requirements^-(a_t,d_t) Representing a penalty for not meeting the load demand.

9. The photovoltaic power generation maximum power point tracking control method according to claim 8, characterized in that: the Q-value algorithm model comprises the following components,

the agent cannot always be in a known environment, so an epsilon-greedy strategy is introduced:

wherein epsilon is a random value, | A(s) | is an action evaluation value;

the epsilon-greedy strategy is the most basic and most common strategy in reinforcement learning, and the meaning of the formula is that the probability of selecting the action which maximizes the action value function is

The probabilities of other actions are equal probabilities

An epsilon-greedy strategy is introduced to balance the relation between the utilization and the exploration of the intelligent agent in the known environment and the unknown environment, wherein the part with the maximum action value function is selected as the utilization, and any probability of other non-optimal actions is the exploration;

the updating rule of Q-learning is as follows:

10. the utility model provides a photovoltaic power generation maximum power point tracking control device which characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

a photovoltaic model module (100), the photovoltaic model module (100) comprising a photovoltaic power source, a DC/DC direct current buck converter and a resistive load;

the reinforcement learning module (200) comprises an agent interacting with the environment, and the agent further comprises a state space model module (201), an action space model module (202), a reward function model module (203) and a Q value algorithm model module (204), which are respectively used for configuring the state space model, the action space model, the reward function model and the Q value algorithm model and used for realizing intelligent tracking of the maximum power point of photovoltaic power generation.