CN117151928A

CN117151928A - Power saving calculation method and device combined with reinforcement learning

Info

Publication number: CN117151928A
Application number: CN202311143879.0A
Authority: CN
Inventors: 刘姚; 陈嘉诺; 孙启文
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-12-01

Abstract

The embodiment of the specification provides a power saving calculation method and device combined with reinforcement learning, wherein the method comprises the following steps: defining the definition, state, action, rewards and strategy of the reinforcement learning algorithm; optimizing an electric appliance control strategy through a reinforcement learning algorithm; and controlling the opening or closing of the electric appliance to save electricity through the optimized electric appliance control strategy.

Description

Power saving calculation method and device combined with reinforcement learning

Technical Field

The present document relates to the field of electrical technology, and in particular, to a power saving calculation method and apparatus combined with reinforcement learning.

Background

In the practical school electricity-saving application scenario, the activity time of students and teachers is not completely regular, and if we only set the on-off state of the electric appliance based on the curriculum schedule and the camera information, the practical requirement may not be met. For example, it is not enough for an et al to turn on the appliance. Therefore, how to reduce the power consumption as much as possible under the premise of ensuring the use requirement of the electric appliance is a technical problem to be solved.

Disclosure of Invention

The invention aims to provide a power saving calculation method and device combined with reinforcement learning, and aims to solve the problems in the prior art.

The invention provides a power saving calculation method combined with reinforcement learning, which comprises the following steps:

defining the definition, state, action, rewards and strategy of the reinforcement learning algorithm;

optimizing an electric appliance control strategy through a reinforcement learning algorithm;

and controlling the opening or closing of the electric appliance to save electricity through the optimized electric appliance control strategy.

The invention provides a power-saving computing device combined with reinforcement learning, which comprises:

the definition module is used for defining the definition, state, action, rewards and strategies of the reinforcement learning algorithm;

the optimizing module is used for optimizing the electrical appliance control strategy through a reinforcement learning algorithm;

and the control module is used for controlling the on or off of the electric appliance to save electricity through the optimized electric appliance control strategy.

By adopting the embodiment of the invention, the control strategy of the electric appliance is optimized by adopting the reinforcement learning method, so that the electric appliance reduces the electricity consumption as much as possible on the premise of ensuring the use requirement.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description that follow are only some of the embodiments described in the description, from which, for a person skilled in the art, other drawings can be obtained without inventive faculty.

FIG. 1 is a flow chart of a power saving computing method incorporating reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a power-saving computing device incorporating reinforcement learning according to an embodiment of the present invention.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive faculty, are intended to be within the scope of the present disclosure.

Method embodiment

According to an embodiment of the present invention, there is provided a power saving calculation method combined with reinforcement learning, and fig. 1 is a flowchart of the power saving calculation method combined with reinforcement learning according to an embodiment of the present invention, as shown in fig. 1, the power saving calculation method combined with reinforcement learning according to an embodiment of the present invention specifically includes:

step S101, defining the definition, state, action, rewards and strategy of the reinforcement learning algorithm; specifically, system power, the number of persons in the electricity-saving space, and activity in the electricity-saving space are defined as states, in which,

1. obtaining system power by monitoring the current and voltage of the system;

2. defining the system performance parameter adjustment as an action, wherein the system performance parameter specifically comprises: air conditioning temperature, CPU frequency and memory size;

3. defining the energy consumption reduction as positive rewards and the system performance reduction as negative rewards; the method specifically comprises the following steps:

the reward function is denoted as R (s, a), where s represents the state and a represents the system performance parameter, then the reward is defined according to equation 1:

wherein α and β are positive constants, respectively, for controlling the weights of positive and negative rewards, P _t Representing the system power at time t, gamma is a weight controlling the perceived impact of the consumer, F _t Is a teacher and student feeling at time t, and if the feeling is good, F _t ＝0Otherwise F _t ＝-1，a _t The system performance parameter at time t is indicated.

4. The method of selecting an action based on the current state is defined as a policy. The method specifically comprises the following steps:

the strategy is expressed as pi (a|s), wherein a represents a system performance parameter, s represents a state, then according to the common formula

Equation 2 defines the strategy:

where n represents the number of actions and A represents the set of actions.

Step S102, optimizing an electric appliance control strategy through a reinforcement learning algorithm; the method specifically comprises the following steps:

setting a reinforcement learning algorithm as a Q-learning algorithm, and initializing a cost function Q (s, a) of the Q-learning algorithm, wherein Q (s, a) represents the value of the selection action a in the current state;

action a is selected according to the e-greedy strategy pi, action a is performed, the prize r and the new state s' are obtained, and the cost function Q (s, a) is updated according to formula 3:

Q(s,a)←Q(s,a)+α·[r+γmax _a′ Q(s′,a′)—Q(s,a)]equation 3;

wherein, the E-greedy strategy refers to randomly selecting actions with a certain probability E, selecting the current optimal actions with a probability of 1-epsilon, wherein alpha is a learning rate, used for controlling the step length of each update, gamma is a discount factor, used for measuring the importance of future rewards, and r represents rewards;

the update state s is s'.

And step S103, controlling the on or off of the electric appliance to save electricity through the optimized electric appliance control strategy. The method specifically comprises the following steps:

initializing system power P, initializing consumer experience F, initializing system performance parameter a, initializing positive and negative reward weights α and β, initializing action set a=a ₁ ,a ₂ ,...,a _n Where n is the number of actions, initializing the cost function Q (s, a) to an arbitrary value, initializingA state of chemical s=p;

action a is selected according to the current state, action a is performed by using a strategy pi (a|s), a reward R and a new state s' are obtained, system power P is updated, user experience F is updated, a reward function R (s, a) is calculated, a cost function Q (s, a) is updated, and the state s|P is updated.

The following describes the above technical solution of the embodiment of the present invention in detail.

It is assumed that there is an electric appliance in the teaching room to be controlled, and the electricity consumption of the first electric appliance is set at the moment. Our goal is to minimize the power usage of each appliance during the day.

The day may be divided into T time segments, each time segment having a length Δt, and the total time of day is tΔt. We can divide the time of day into T times, with the jth time being tj=jΔt. The total amount of electricity used by the ith appliance in a day is:

our goal is to minimize the sum of the power usage of all appliances in a day:

obviously, this is a problem of linear programming, so we can model this problem as an optimization problem as follows:

namely, the electricity consumption of each electric appliance in one day is minimized by saving the electricity consumption of a single electric appliance as much as possible. To achieve the above function, we can embed school curriculum and camera information into our model. In the simplest terms, we can put the power state x of the appliance _i,j Two situations are distinguished: on and off, i.e. 1 and 0. If there is no course or current timeIf no person exists in the classroom, setting the corresponding state of the electrical appliance to be closed; otherwise, the device is set to be opened.

However, it is necessary to consider that in the actual application scenario, the activity time of students and teachers is not completely regular, and if the on-off state of the electric appliance is set based on only the curriculum schedule and the camera information, the actual requirement may not be satisfied. For example, it is not enough for an et al to turn on the appliance. Therefore, the control strategy of the electric appliance is optimized by adopting the reinforcement learning method, so that the electric appliance reduces the electricity consumption as much as possible on the premise of ensuring the use requirement.

1.1 reinforcement learning Algorithm

Reinforcement learning is a machine learning method that learns the best strategy by trial and error. In this problem, we can consider the control strategy of the appliance as an agent, and choose to turn on or off the appliance at each moment according to the current environment (e.g. whether someone is present, time, etc.), so as to obtain an instant prize (e.g. reduce the power consumption). By constantly interacting with the environment, the intelligent agent can learn the optimal appliance control strategy, thereby reducing the electricity consumption as much as possible on the premise of ensuring the use requirement. Under the reinforcement learning framework, we need to define concepts of states, actions, rewards, and strategies.

1.1.1 State definition

In the power saving algorithm, we take the power of the system and the number of classrooms/textbooks as states. Power may be obtained by monitoring the current and voltage of the system. Assuming that the system has power P, N people in the classroom, lessons in 10 minutes, the state s can be expressed as s= (P, N, 1).

1.1.2 behavior definition

In the power saving algorithm, the system performance parameter may be adjusted as an action, such as adjusting the air conditioning temperature, CPU frequency, memory size, etc. Assuming we express a system performance parameter as a, action set a may be expressed as a=a1, a 2.

1.1.3 prize definition

In the power saving algorithm, we can treat the decrease of energy consumption as positive rewards and the decrease of system performance as negative rewards. Assuming we denote the bonus function as R (s, a), we can define as follows:

where α and β are positive constants, respectively, that are used to control the weights of the positive and negative rewards. P (P) _t The system power at time t is indicated. Gamma is a weight for controlling influence of teachers and students, F _t Is a teacher and student feeling at time t, and if the feeling is good, F _t =0, otherwise F _t = -1. Thus, the intelligent agent can control the system power and consider the use feeling of teachers and students.

1.1.4 policy definition

In the power saving algorithm, we can refer to a method of selecting an action according to the current state as a policy. Assuming we express the policy as pi (a|s), we can define as follows:

that is, in each state, an even distribution is employed to select an action.

1.2 Algorithm

1.2.1 Q-learning algorithm

The learning algorithm is a reinforcement learning algorithm and can be used for optimizing strategies so as to achieve the aim of optimization. The basic idea of the algorithm is to constantly optimize the strategy by iteratively updating the cost function. The algorithm flow is shown in table 1:

TABLE 1

Wherein, the E-greedy strategy refers to randomly selecting actions with a certain probability E and selecting the current optimal actions with a probability of 1-E. Alpha is the learning rate used to control the step size of each update. Gamma is a discount factor used to measure the importance of future rewards.

In the power saving algorithm, the cost function of the Q-learning algorithm may be expressed as Q (s, a) =q (l, e, a), i.e., the value of the selection action a in the current state.

The strategy can be continuously optimized through continuous iteration of the Q-learning algorithm, so that the aim of minimizing energy consumption is fulfilled.

1.2.2 Power saving algorithm combining reinforcement learning

Aiming at the reinforcement learning algorithm, the learning period is longer, and the feedback period is long and the success is high in a real environment. In order to solve the problem, the embodiment of the invention adopts a simulation method, trains a model on a computer and applies the model to actual production. The specific algorithm is shown in table 2:

TABLE 2

Device embodiment

According to an embodiment of the present invention, there is provided a power saving computing device combined with reinforcement learning, and fig. 2 is a schematic diagram of the power saving computing device combined with reinforcement learning according to the embodiment of the present invention, as shown in fig. 2, the power saving computing device combined with reinforcement learning according to the embodiment of the present invention specifically includes:

a definition module 20, configured to define a definition, a state, an action, a reward, and a policy of the reinforcement learning algorithm; the definition module 20 is specifically configured to:

defining system power, the number of people in the electricity-saving space, and activity in the electricity-saving space as states, wherein the system power is obtained by monitoring the current and voltage of the system;

defining the system performance parameter adjustment as an action, wherein the system performance parameter specifically comprises: air conditioning temperature, CPU frequency and memory size;

the energy consumption reduction is defined as positive rewards and the system performance reduction is defined as negative rewards, in particular:

wherein α and β are positive constants, respectively, for controlling the weights of positive and negative rewards, P _t Representing the system power at time t, gamma is a weight controlling the perceived impact of the consumer, F _t Is a teacher and student feeling at time t, and if the feeling is good, F _t =0, otherwise F _t ＝-1，a _t Representing system performance parameters at time t;

the method of selecting an action according to the current state is defined as a policy, specifically:

the policy is expressed as pi (a|s), where a represents a system performance parameter and s represents a state, then the policy is defined according to equation 2:

where n represents the number of actions and A represents the set of actions.

An optimizing module 22, configured to optimize an appliance control strategy through a reinforcement learning algorithm; the optimization module 22 is specifically configured to:

Q(s,a)←Q(s,a)+α·[r+γmax _a′ Q(s′,a′)—Q(s,a)]equation 3;

wherein, epsilon-greedy strategy refers to randomly selecting actions with a certain probability epsilon, selecting the current optimal actions with a probability of 1 epsilon, alpha is learning rate, used for controlling step length of each update, gamma is discount factor, used for measuring importance of future rewards, and r represents rewards;

the update state s is s'.

And the control module 24 is used for controlling the on or off of the electric appliance to save electricity through the optimized electric appliance control strategy. The control module 24 is specifically configured to:

initializing system power P, initializing consumer experience F, initializing system performance parameter a, initializing positive and negative reward weights α and β, initializing action set a=a ₁ ,a ₂ ,...,a _n Where n is the number of actions, the initialization cost function Q (s, a) is an arbitrary value, the initialization state s=p;

The embodiment of the present invention is an embodiment of a device corresponding to the embodiment of the method, and specific operations of each module may be understood by referring to descriptions of the embodiment of the method, which are not repeated herein.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A power saving computing method in combination with reinforcement learning, comprising:

2. The method of claim 1, wherein defining the reinforcement learning algorithm definition, state, action, rewards, and policies specifically comprises:

defining the energy consumption reduction as positive rewards and the system performance reduction as negative rewards;

the method of selecting an action based on the current state is defined as a policy.

3. The method of claim 2, wherein defining the energy consumption reduction as a positive reward and the system performance reduction as a negative reward specifically comprises:

wherein α and β are positive constants, respectively, for controlling the weights of positive and negative rewards, P _t Representing the system power at time t, gamma is a weight controlling the perceived impact of the consumer, F _t Is at the time ofT is carved with the teacher and student feeling, if the feeling is good, F _t =0, otherwise F _t ＝-1，a _t The system performance parameter at time t is indicated.

4. The method according to claim 2, wherein defining the method of selecting an action based on the current state as a policy specifically comprises:

where n represents the number of actions and A represents the set of actions.

5. The method of claim 4, wherein optimizing the appliance control strategy by the reinforcement learning algorithm specifically comprises:

Q(s,a)←Q(s,a)+α·[r+γmax _a′ Q(s′,a′)—Q(s,a)]equation 3;

wherein, the E-greedy strategy refers to randomly selecting actions with a certain probability E, selecting the current optimal action with a probability of 1-E, wherein alpha is a learning rate, used for controlling the step length of each update, gamma is a discount factor, used for measuring the importance of future rewards, and r represents rewards;

the update state s is s'.

6. The method of claim 5, wherein controlling the appliance to turn on or off via the optimized appliance control strategy comprises:

7. A power saving computing device incorporating reinforcement learning, comprising:

8. The method according to claim 1, wherein the definition module is specifically configured to:

where n represents the number of actions and A represents the set of actions.

9. The apparatus of claim 8, wherein the optimization module is specifically configured to:

Q(s,a)←Q(s,a)+α·[r+γmax _a′ Q(s′,a′)—Q(s,a)]equation 3;

the update state s is s'.

10. The apparatus of claim 9, wherein the control module is specifically configured to: