CN110945542B

CN110945542B - Multi-agent deep reinforcement learning agent method based on smart grid

Info

Publication number: CN110945542B
Application number: CN201880000858.4A
Authority: CN
Inventors: 侯韩旭; 郝建业; 杨耀东
Original assignee: Dongguan University of Technology
Current assignee: Dongguan University of Technology
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2023-05-05
Anticipated expiration: 2038-06-29
Also published as: WO2020000399A1; CN110945542A

Abstract

The invention is suitable for the technical field of electric power automation control, and provides a multi-agent deep reinforcement learning agent method based on a smart grid, which comprises the following steps: s1, calculating a corresponding action standard value in the current state according to rewards obtained by the selected actions, and updating parameters of the neural network; s2, establishing a multi-agent with 'external competition and internal cooperation' according to the types of consumers and producers; s3, setting a reward function of each internal agent according to profit maximization of the actions of the agents and benefits of other internal agents. The input layer of the neural network may accept direct input of values of the feature characterizing the state, while the Q-table needs to discretize the feature values to reduce the state space.

Description

Multi-agent deep reinforcement learning agent method based on smart grid

Technical Field

The invention belongs to the technical field of electric power automation control, and particularly relates to a multi-agent deep reinforcement learning agent method based on a smart grid.

Background

Smart grid refers to realizing grid modernization by using a series of digital communication technologies ^[1][2] . The economical, national defense and even resident safety of a country depend on the reliability of the power grid, and in actual operation, the intelligent power grid not only can facilitate users to select corresponding power packages in real time, but also can actively allocate power resources to realize balanced supply of power. The power grid can make real-time adjustment and feedback to market fluctuation, realizes bidirectional information communication service and comprehensive power grid condition sensing, and is an important component of 21 st century modernization.

Previously, grid technology was designed primarily to unidirectionally supply power from large centralized power plants to distributed consumers such as homes and industrial facilities. Recently, a popular research topic of smart grids is to predict the power demand of users, so as to adjust the price of electricity and bidding strategies in advance to maximize the proxy income ^[3] . At the same time, the proxy mechanism is also intelligent electricityAnother core of the network design is that smart grids are comprehensively arranged among local producers, local consumers, large power plants and other agents through an agent mechanism, and a multi-party win-win is realized by using a market regulation mechanism. One of the key problems is to achieve two-way communication of the grid between the consumer and the small producer of local wind and solar power, reddy et al ^[4] The earliest proposed use of reinforcement learning frameworks to design agents for local networks as a solution to this problem. One key element of the reinforcement learning framework is the state space, learning strategies from manually constructed features ^[4] But this limits the number of economic signals that an agent can accommodate and also limits the ability of an environment changing agent to absorb new signals. Reinforcement learning has been applied to the e-commerce field to solve many practical problems, mainly by learning optimal strategies through interaction of agents with the environment, such as paldo et al ^[5] A data driven approach is proposed to design electronic auction based on reinforcement learning. In the power domain, reinforcement learning is used to study wholesale market trading strategies ^[6] Or to assist in the establishment of a physical control system. Examples of power wholesale applications include ^[7] The bidding strategy of the electric wholesale auction is mainly studied, and Ramavajjala et al ^[8] Research Next State Policy Iteration (NSPI) as a pair Least Squares Policy Iteration (LSPI) ^[9] And demonstrate their benefits of expanding the problem of pre-delivery commitments to wind power generation. Physical control applications for reinforcement learning include load and frequency control of the power grid and autonomous monitoring applications, e.g ^[10] . However, the previous work on grid agents is mostly idealized for the setting of the grid environment, on the one hand, using a large number of simple settings to simulate the complex grid operation mechanism, and on the other hand, providing a high level of abstraction of the information provided by the environment when designing the algorithm, losing many important details, resulting in inaccuracy of the decision.

On the other hand, customers in smart grids exhibit various power consumption or production patterns. This suggests that we need to formulate different pricing policies for different types of customers. Following this idea, retail agents can be considered multi-agent systems, as each agent is responsible for pricing a particular class of power consumers or producers. For example, wang et al assign a separate pricing agent for each customer in its agent framework ^[23] . However, the authors use a separate reinforcement learning process for different customers and view the profit of the whole agent as an immediate return for each agent. It does not distinguish between the individual contributions of each agent to agent profit and therefore does not motivate agents to learn the best strategy.

Reinforcement learning, unlike traditional machine learning, is a process in which a strategy for maximizing jackpot is progressively learned by constant interaction with the environment ^[14] . Reinforcement learning simulates the cognitive processes of humans, has a wide range of disciplines, and is studied in many disciplines, such as game theory and control theory. Reinforcement learning allows agents to learn strategies from the environment, which is typically set as a Markov Decision Process (MDP) ^[15] While many algorithms employ dynamic programming techniques in this setting ^[16][17][18] 。

The basic reinforcement learning model includes:

a series of environment and agent states s= { S ₁ ；s ₂ ；…；s _n }；

A series of agent actions a= { a ₁ ；a ₂ ；…；a _n }；

Describing a transfer function delta (s, a) →s' between states;

the bonus function r (s, a).

In many works, if an agent is assumed to be able to observe the environmental state at the present moment, it is said to be fully observable, and vice versa it is said to be partially observable. An agent based reinforcement learning communicates with the environment in discrete time steps. As in fig. 2-1, at each time t, the agent obtains a prize r, which typically includes this time _t Then selecting an action a from the selectable actions, then this action acts on the environment, which under action reaches a new state s _t+1 The intelligent agent obtains newThe prize r of the moment of (2) _t+1 And (5) repeating the steps. Reinforcement learning based multi-wisdom learns gradually in interactions with the environment to maximize the jackpot strategy pi: S.fwdarw.A. In order to learn near optimum, the agent must learn the adjustment strategy for a long period of time. The basic setting and learning process of reinforcement learning is very suitable for the power grid field.

With respect to how to find the optimal strategy we introduce here a value function method. The value function approach attempts to find a strategy maximizing return by maintaining an estimate of a series of expected returns for some strategies. To formally define optima, we define a policy with the value:

V ^π (s)＝E[R|s,π] (2-1)

r represents the random return obtained following the strategy pi starting from the initial state s. Definition V ^* (s) as V ^π The maximum possible value of(s):

V ^* (s)＝max _π V ^π (s) (2-2)

the strategy that can achieve these optimal values in each state is called an optimal strategy. Although the state values are sufficient to define an optimum, it is also useful to define an action value. Given a state s, an action a and a policy pi, the action value of the (s, a) pair under the policy pi is defined as:

Q ^π (s,a)＝E[R|s,a,π] (2-3)

r represents the jackpot that is obtained by taking action a before following policy pi in state s. From the theory of MDP, we can always determine the optimal action by simply selecting the action with the highest median value per state, given the Q value of the optimal strategy. The action value function of such an optimal strategy is denoted as Q ^* . It is sufficient to know how to achieve the optimization by knowing the optimal action value.

When both the transfer function and the bonus function of the environment are unknown, we can use Q-learning to upgrade the action value function:

Q _t (s,a)←(1α _t )Q _t-1 (s,a)+α _t [r _t +γmax _a′ Q _t-1 (s′,a′) (2-4)

wherein alpha is _t Is the learning rate, r _t Is the reward at the current time and gamma is the discount factor. Each time interacting with the environment, the current action value Q is updated once _t (s, a) and the Q value under action at the previous moment are reserved for a part, and Q (s, a) is recalculated according to the obtained rewards at the current moment and the new state reached, and is combined with the previous part of experience to be the new action value at the moment.

An artificial neural network, a computational model used in machine learning, computer science and other research fields ^[19][20] . Artificial neural networks are based on a large number of interconnected elementary units, artificial neurons.

Typically, the artificial neurons of each layer are interconnected, with signals being input from the first input layer to the last output layer. Current deep learning projects typically have thousands to millions of neural nodes and millions of connections. The goal of artificial neural networks is to solve the problem in a human-like manner, although some kinds of neural networks are more abstract. The networks in the neural network represent the connection of artificial neurons between different layers in each system. A typical artificial neural network is defined by three types of parameters:

the connection modes of different layers of neurons;

weights in these connections, which can be updated in a later learning process;

the weighted input of a neuron is converted into an activation function whose output is activated.

Mathematically, a function f (x) represented by a neural network is defined as the other function g _i (x) Is a combination of (a) and (b). May be conveniently represented as a network structure with arrows describing dependencies between variables. One widely used form is the nonlinear weighted sum:

f(x)＝K(∑ _i w _i g _i (x)) (2-5)

where K represents the activation function. The most important property of the activation function is that it provides a smooth transition when the input value changes, such as a small change in input causing a small change in outputAnd (3) changing. Thus, depending on the weights in the connection, the inputs will be continually adjusted until the output is finally formed. But such output is not generally what we want, so we also need the neural network to learn. The neural network is the most attractive possibility for learning. Given a specific task that needs to be learned, and a set of learned objective functions F, learning means finding a function F in F through a series of observations ^* As a solution to the task. Thus, we define a loss function C:

for the optimal function f ^* There is no other solution with a ratio f ^* Also small loss function value:

the loss function is an important concept of learning, which is a measure of the distance of a particular solution from an optimal solution. The learning process is to search the solution space of the problem to find the function with the smallest loss function value. For application problems where solutions need to be found in the data, the loss must be a function of these actual observed samples. The loss function is typically defined as a statistic, since typically only statistically observed samples can be evaluated. Therefore, for the problem of finding the model function f, it is to minimize the loss function c=e [ (f (x) -y) ² ]Wherein the data pairs (x, y) come from some distribution D. In practical applications, we usually have only N limited samples, so we can only minimize

Thus, the loss function is minimized over some samples of the data, rather than over the theoretical distribution of the entire data set. When minimizing the sample-based loss function values, we find the optimal parameters of the neural network on these samples.

Q-network since neural networks can be fitted as a function, the Q-value function in reinforcement learning can also be fitted with neural networks ^[21][22] This has the great advantage that the state space of the traditional Q-table must be limited and not excessively large, so that we can use the Q-table to store the value of the state action pair, and use the Q-network, we do not need to consider the discretization of the state space, only need to directly input the characteristic value representing the state into the neural network, and let the parameters in the network fit the Q-value function, so that the problem of infinite state space is naturally solved. However, unlike conventional neural network applications, reinforcement learning does not have as many samples as it has at the beginning, but rather obtains new rewards and observations by constantly interacting with the environment, and at the same time, reinforcement learning does not have labeling of samples as a basis for judging whether the model outputs an accurate judgment. However, if we override the traditional application of the neural network, and consider the neural network as a means for storing Q (s, a) like a Q-table from the function fitting function of the neural network itself, we can update the parameters in the neural network like updating the Q-table to bring the Q (s, a) output by it close to the currently considered value every time the agent interacts with the environment.

We now consider how to design the input-output and loss functions of the Q-network to be functionally identical to the Q-table. First, the input is still state S, but rather than discretizing the state space from infinity to a finite number as in conventional reinforcement learning, each feature representing the state space can be directly used as an input to the neural network. Meanwhile, like Q-tables, storing a row of values for each state represents an estimated cumulative prize for each action in that state, each node of the output layer of the neural network represents an action, the output value of each node being the estimated cumulative value Q (S, a _i ). By designing the input layer and output of the neural network in this wayLayer, we let the neural network implement the function of storing Q (s, a). At the same time, we need to consider how to update the parameters of the artificial neural network, and according to the definition of the loss function, we have no ready-made mark y for the input state _i However, according to the upgrade formula for the Q-learning versus action value, we can look at the Q (s, a) already stored in the artificial neural network and the prize r at the present time _t To upgrade parameters in the neural network. For example, at time t, the agent is in state s _t Next, when it selects action a according to the policy _t Thereafter, enter the next state s _t+1 And obtain rewards r _t . Now, when we upgrade parameters in a neural network, we want Q (s _t ,a _t ) Should be updated to the updated portion r in Q-learning _t +max _a′ Q(s′,a′)：

C＝[Q _t (s _t ,a _t )-(r _t +max _a′ Q _t-1 (s _t+1 ,a＇))] ² (2-8)

Even if the action value at the current time is made to approach the update section. Also, the learning rate is set at the time of updating. Thus, the process of using the Q-network to store updated action values is the same as using the Q-table directly, the only difference being that the input layer of the neural network can accept direct input of values of the features characterizing the states, while the Q-table requires discretizing the feature values to reduce the state space.

Reference to the literature

[1]M.Amin and B.Wollenberg.Toward a smart grid:Power delivery for the 21st century.IEEE Power and Energy Magazine,3(5):3441,2005.

[2]C.Gellings,M.Samotyj,and B.Howe.The future’s power delivery system.IEEE Power Energy Magazine,2(5):4048,2004.

[3]Wang X,Zhang M,Ren F.Load Forecasting in a Smart Grid through Customer Behaviour Learning Using L1-Regularized Continuous Conditional Random Fields[C].Proceedings of the 2016International Conference on Autonomous Agents&Multiagent Systems.International Foundation for Autonomous Agents and Multiagent Systems,2016:817-826.

[4]Reddy P P,Veloso M M.Strategy learning for autonomous agents in smart grid markets[J].2011.

[5]Pardoe D,Stone P,Saar-Tsechansky M,et al.Adaptive Auction Mechanism Design and the Incorporation of Prior Knowledge[J].INFORMS Journal on Computing,2010,22(3):353-370.

[6]Babic J,Podobnik V.An analysis of power trading agent competition 2014[M].Agent-Mediated Electronic Commerce.Designing Trading Strategies and Mechanisms for Electronic Markets.Springer International Publishing,2014:1-15.

[7]Petrik M,Taylor G,Parr R,et al.Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes[J].Computer Science,2010.

[8]Ramavajjala V,Elkan C.Policy iteration based on a learned transition model[C].European Conference on Machine Learning and Knowledge Discovery in Databases.Springer-Verlag,2012:211-226.

[9]Lagoudakis M G,Parr R.Least-squares policy iteration[M].JMLR.org,2003.

[10]Venayagamoorthy G K.Potentials and promises of computational intelligence for smart grids[C].Power&Energy Society General Meeting,2009.PES'09.IEEE.IEEE,2009:1-6.

[11]Wikipedia [ EB/OL].https:.en.wikipedia.org/wiki/Smart_grid

[12]EPRI[EB/OL].https:.www.epri.com/#/about/epri

[13]Kintner-Meyer M C,Chassin D P,Kannberg L D,et al.GridWise:The benefits of a transformed energy system[J].Pacific Northwest National Laboratory under contract with the United States Department of Energy,2008:25.

[14]Sutton R S,Barto A G.Reinforcement learning:An introduction[M].Cambridge:MIT press,1998.

[15]Littman M L.Markov games as a framework for multi-agent reinforcement learning[C].Proceedings of the eleventh international conference on machine learning.1994,157:157-163.

[16]Lewis F L,Vrabie D.Reinforcement learning and adaptive dynamic programming for feedback control[J].IEEE circuits and systems magazine,2009,9(3).

[17]Busoniu L,Babuska R,De Schutter B,et al.Reinforcement learning and dynamic programming using function approximators[M].CRC press,2010.

[18]Szepesvári C,Kioloa M.Reinforcement learning:dynamic programming[J].University of Alberta,MLSS,2008,8.

[19]Wikipedia [ EB/OL].https:en.wikipedia.org/wiki/Artificial_neural_ network

[20]Wang S C.Artificial neural network[M].Interdisciplinary computing in java programming.Springer US,2003:81-100.

[21]Mnih V,Kavukcuoglu K,Silver D,et al.Playing Atari with Deep Reinforcement Learning[J].Computer Science,2013.

[22]Huang B Q,Cao G Y,Guo M.Reinforcement learning neural network to the problem of autonomous mobile robot obstacle avoidance[C].Machine Learning and Cybernetics,2005.Proceedings of 2005International Conference on.IEEE,2005,1:85-89.

[23]DoE[EB/OL].http:.www.eia.doe.gov,2010.

[24]Olfati-Saber R,Fax J A,Murray R M.Consensus and cooperation in networked multi-agent systems[J].Proceedings of the IEEE,2007,95(1):215-233.

[25]Ferber J.Multi-agent systems:an introduction to distributed artificial intelligence[M].Reading:Addison-Wesley,1999.

[26]Littman M L.Markov games as a framework for multi-agent reinforcement learning[C].Proceedings of the eleventh international conference on machine learning.1994,157:157-163.

[27]Tan M.Multi-agent reinforcement learning:Independent vs.cooperative agents[C].Proceedings of the tenth international conference on machine learning.1993:330-337.

[28]Wiering M.Multi-agent reinforcement learning for traffic light control[C].ICML.2000:1151-1158.

[29]Hernández L,Baladron C,Aguiar J M,et al.A multi-agent system architecture for smart grid management and forecasting of energy demand in virtual power plants[J].IEEE Communications Magazine,2013,51(1):106-113.

[30]Niu D,Wang Y,Wu D D.Power load forecasting using support vector machine and ant colony optimization[J].Expert Systems with Applications,2010,37(3):2531-2539.

[31]Li H Z,Guo S,Li C J,et al.A hybrid annual power load forecasting model based on generalized regression neural network with fruit fly optimization algorithm[J].Knowledge-Based Systems,2013,37:378-387.

[32]Gong S,Li H.Dynamic spectrum allocation for power load prediction via wireless metering in smart grid[C].Information Sciences and Systems(CISS),2011 45th Annual Conference on.IEEE,2011:1-6

[33]Xishun Wang,Minjie Zhang,and Fenghui Ren.A hybrid-learning based broker model for strategic power trading in smart grid markets.Knowledge-Based Systems,119,2016.

[34]Electricity consumption in a sample of london households,2015.https://data.london.gov.uk/dataset/smartmeter-energyuse-data-in-london- households.

[35]S Hochreiter and J Schmidhuber.Long short-term memory.Neural Computation,9(8):1735–1780,1997.

Disclosure of Invention

The invention aims to provide a multi-agent deep reinforcement learning agent method based on a smart grid, which aims to solve the problem that the state space of an agent is infinite.

The invention is realized in such a way, a multi-agent deep reinforcement learning agent method based on a smart grid, the multi-agent deep reinforcement learning agent method comprises the following steps:

s1, calculating a corresponding action standard value in the current state according to rewards obtained by the selected actions, and updating parameters of the neural network;

s2, establishing a multi-agent with 'external competition and internal cooperation' according to the types of consumers and producers;

s3, setting a reward function of each internal agent according to profit maximization of actions of the agent and benefits of other internal agents, wherein the reward function has a functional formula:

wherein C represents the category of the consumer, P represents the category of the producer, < ->

Representing agent B _k Internal agent, i.e { C ₁ ,C ₂ ,P ₁ ,P ₂ }，κ _t,C Representing the amount of electricity a consumer of a certain class consumes at time t, κ _t,P Representing the amount of electricity a certain class of producers produces at time t,/->

Is the unbalanced partial cost in calculating the profit of the monomer.

The invention further adopts the technical scheme that: the step S1 further comprises the following steps:

s11, initializing parameters of the neural network;

s12, initializing a state value in the operation period when each period starts;

s13, selecting a state value or selecting a maximum action value of actions in the current state by using the probability;

s14, executing the selected action value and obtaining rewards and then entering the next state;

s15, calculating a standard value corresponding to the current state to update the neural network parameters so as to store Q (S) _t ,a _t ) Close to y _t 。

The invention further adopts the technical scheme that: in the step S15, the action values are stored in the parameters, and the characteristic values are sequentially input into the neural network each time a new state is entered, so that the action with the maximum Q (S, a) value can be selected from the output layer of the neural network as the next execution action.

The invention further adopts the technical scheme that: the step S2 comprises the following steps:

s21, classifying consumers according to the power consumption difference;

s22, classifying the producers according to the actual power generation conditions.

The invention further adopts the technical scheme that: in the step S3, each agent considers the benefit of itself at the same time as considering the benefit of itself in selecting actions through the bonus function.

The invention further adopts the technical scheme that: the consumers are classified into daytime consuming users and full-day consuming users according to the condition of consuming power.

The invention further adopts the technical scheme that: the producers are classified into full-day generators and daytime generators according to the actual power generation conditions.

The beneficial effects of the invention are as follows: the input layer of the neural network may accept direct input of values of the feature characterizing the state, while the Q-table needs to discretize the feature values to reduce the state space.

Drawings

Fig. 1 is a classical scenario diagram of reinforcement learning.

FIG. 2 is a schematic illustration of a neural network including a hidden layer, the first layer having neurons that communicate data through synapses to neurons of a second layer, which in turn communicate data through synapses to neurons of a third layer, in accordance with an embodiment of the present invention. The synapse stores a schematic diagram of parameters called weights that manipulate the data in the computation.

Fig. 3 is a schematic diagram of a proxy framework.

FIG. 4 is a schematic diagram of a cyclic DQN.

FIG. 5 is a schematic of the benefit distribution for each of the 20 runs.

FIG. 6 is a schematic diagram of the revenue distribution for each of 20 runs of experiments in a multi-class user environment.

Fig. 7 is a graph of power usage by different types of users.

FIG. 8 is a schematic representation of proxy revenue for an evaluation period.

Detailed Description

The negotiation algorithm of the agent is improved in two aspects in work, namely, the problem that the state space of the agent is infinite is solved; secondly, by slightly changing the local environment, the situation is more real, and meanwhile, the design mode of the multi-intelligent agent with external competition and internal cooperation is correspondingly provided, so that the multi-intelligent agent has more competitive power. Finally, we introduced real electricity usage data while helping our agent framework learn effective pricing strategies in more complex environments with some advanced timing techniques.

Smart grid and smart grid for set local market ^[4] Only in the second modification will the kind of consumers and producers and the manner in which the electricity is produced/used be designed for. In the local market, there are consumers that consume electricity and small producers that produce electricity, while there are several agents that buy and sell electricity between consumers and small producers. The agent is required to be arranged, because the small producer and the consumer are inconvenient to directly coordinate, the electric power consumer can conveniently buy and sell electric power through the intermediate link of the agent, and the resource can be better coordinated, so that the supply and demand balance of the electric power resource is ensured. The agents are in the specific form of issuing one contract per hour to all producers and consumers, all users selecting contracts from different agents, each agent being able to learn the contract price of the other agents at that moment and how many producers and consumers select their own contracts. In this way, the agent adjusts the contract price of the agent at the next moment according to the contract subscription and the contract prices of other agents to realize the maximization of the profit of the agent. Thus, each hour is taken as a basic time unit, the agent and the environment interact once, and the user subscribes to the contract.

When subscribing to producers and consumers of proxy contractsWhen the power required by the users is different, unbalance of power supply and demand occurs. At this time, we do not deal with the balance of power through wholesale market, but set a penalty cost, which acts as a penalty for the power supply imbalance. Next, we describe this local market more clearly by definition. First, for electricity prices, we set a price in the range of 0.01 to 0.20 ^[23] The minimum price change is 0.01. Each agent B _k (k=1, 2, …, K) there are two bids at time t, one being a bid to the consumer

The other is the bid for the producer>

In addition, at each time t, agent B _k The number of producers and consumers subscribed to themselves: />

And

for convenience we assume that each consumer consumes a power of κ at each time t _t,C And the electricity generated by each producer at each moment is kappa _t,P . Finally, we set the unbalanced charge per unit power at time t to be φ _t . At this time, computing agent B _k The rewards at time t are clear: />

Thus, we generally define the basic mode of operation of the local market. Next, we first describe the definition of two state indexes. The first is an index PriceRangeStatus (PRS) to determine if the market is reasonable, which must be met if the market is reasonable at one agency:

wherein mu _L Is a subjective value that represents the agent's expectations for marginal benefits of the market. At the same time, the method comprises the steps of,

wherein B is _L Representing this agent itself. The second indicator PortfolioStatus (PS) indicates whether the agent itself is achieving supply-demand balance. Next, we set several actions on price operations as a set of actions that all agents can choose.

A＝{Maintain,Lower,Raise,Revert,Inline,MinMax}

Each agent sets the price at the next time at time t by these actions

And->

● The Maintain represents the price to Maintain the last time;

● Lower decreases by 0.01 based on the price of both the producer and consumer at time t;

● Raise increases by 0.01 based on the price of both the producer and consumer at time t;

● The reverse moves 0.01 towards the midpoint price,

● Inline sets new producer and consumer prices as

MinMax sets new producer and consumer prices as

Setting of several fixed policy competing agents for comparison and verification we devised several fixed policy agents. Balance strategies attempt to reduce supply imbalance by adjusting producer and consumer contract prices, which is increased when it sees an excess of demand, and decreased when it sees an excess of supply. Greedy strategies attempt to maximize profit by increasing profit margin, i.e., maximizing the difference between consumer and producer contract prices when the market is rational. Both strategies can be characterized as adaptive in that they will react to market and portfolio conditions, but they will not learn from the past. Meanwhile, two non-adaptive agents are designed, and an agent with a fixed strategy always maintains a certain price; a random agent, the adjustment of the price at a time randomly selects one of six actions.

Table 3-1 Balancing algorithm

TABLE 3-2 greedy Algorithm

The first modification is to change the original storage structure of the Q-learning from Q-table to Q-network, and the current method is completely consistent with the method of Q-learning, i.e. the parameters inside the storage structure are updated after the interaction with the environment is completed each time. Later work also considered the mechanism of experience replay.

TABLE 3-3Q-learning algorithm using Q-network

/>

The first line of algorithm initializes the parameters of the neural network. The second row indicates that the experiment will run for M cycles, the third row indicates that the state value will be initialized at the beginning of each cycle, the fifth and sixth rows indicate that random selection is performed with a certain probability, otherwise the action with the largest action value in the current state is selected. The seventh row indicates that the selected action was performed, then rewarded and the next state was entered. The eighth row calculates the standard value of the corresponding action in the current state, and the ninth row indicates that the parameters of the neural network are updated according to the calculated value of the eighth row, so that the stored Q (s _t ,a _t ) Close to y _t . In this way, the action value is stored in the parameter, and each time a new state is entered, only the characteristic values of the states are sequentially input into the neural network, so that the action with the largest value of Q (s, a) can be selected from the output layer of the neural network as the next execution action.

In addition to attempting to use neural networks to store reinforcement-learned action values, intelligent negotiation algorithms based on agents (agents) consider the more realistic situation where consumers are of multiple types, and small producers are also classified into wind power generation and solar power generation. To investigate this general phenomenon, we made corresponding changes to the environment. Firstly, consumers are divided into two types, namely ordinary users who do not need to consume power at night, and users who need to consume power all day; then, according to the actual situation, the manufacturers are also divided into two types, one is a wind power generator capable of generating power all day, and the other is a solar power generator capable of generating power only in the daytime. Therefore, weThere are four types of users in the current power grid environment, and the original mode of uniformly adjusting the price by one agent is not applicable to a certain extent, so we propose a new multi-agent with 'external competition and internal cooperation', namely the agent is represented as one agent in the external competition, but the multi-agent with one agent is arranged for each type of user in the internal reality. The inside of the multi-intelligence body can coordinate with each other and cooperate with each other ^{[24][25][26][27][28]} Such a multi-mental framework is more adaptable to particularly complex external environments. The contract price of the user can be adjusted more pertinently on the premise of the original power grid rule so as to obtain the maximization of the profit of the user.

However, although this multi-agent appears externally as a proxy, there are four different agents inside, and how to have these agents guarantee to work together internally is a matter that needs to be considered. In order to make the agents inside the agents cooperate with each other as much as possible to form a real group to compete with other agents, we need to redesign the rewarding function of each internal agent, so that the action of each agent needs to consider not only the profit maximization of itself, but also the benefits of other internal agents. We redesign the bonus function for each internal agent:

wherein C represents the category in which the consumer is located, P represents the category in which the producer is located,

representing agent B _k Internal agent, i.e { C ₁ ,C ₂ ,P ₁ ,P ₂ }。κ _t,C Representing the amount of electricity a consumer of a certain class consumes at time t, κ _t,P Representing the amount of electricity produced by a certain class of producers at time t. But->

Is the unbalanced partial cost in calculating the profit of the monomer. />

Furthermore:

since it is only possible for a single agent to purchase power from the producer's hand or sell power to the consumer, it is not good to directly measure his own profit, but we can instead consider its contribution to the total profit, i.e. the loss to the total profit without this agent buying or selling power is the agent's own profit. By considering from the overall relationship, we have obtained the self-profit of a single agent. Thus, by means of the newly designed reward function, each agent can consider the own interests and the whole interests when selecting actions.

Multi-agent framework for real data simulation in order to verify the effectiveness of our agent framework in complex environments, we introduced real electricity data of home users in 2013 in london city ^[34] From which we have chosen about 1000 users. First, it is not enough to issue only one price to all consumers. That is, we consider only home users in the retail market, but their electricity usage patterns are different due to different lifestyles and consumption concepts. Thus, using multiple agents to distribute corresponding electricity prices for different groups of consumers may better promote supply and demand balance. Here we group consumers according to electricity usage profile. Considering that power consumption is time series data, our agents use K-Means based on a Dynamic Time Warping (DTW) distance criterion for clustering. After clustering, we can obtain a user population of similar electricity usage behavior. The proxy structure in a real data simulation environment is shown in fig. 3.

Second, since the user's power consumption behavior varies from moment to moment, we use Long Short-Term Memory (LSTM) with excellent performance in time series ^[35] To enhance our network architecture to help agents better extract timing information from past market information to make efficient decisions. Finally, the neural network architecture used by our agent is shown in FIG. 4.

The setting of experimental parameters, besides parameters in the definition of the method, also parameters in the operation of many experiments, we describe here one by one that our experiments have five agents in total, namely agent, balance policy agent, greedy policy agent, agent with fixed price and agent with random action. The number of consumers in the local power grid market is set to 1000, the number of producers is set to 100, the power consumed by consumers per hour is 10 basic power units, and the power produced by producers per hour is 100 basic power units. The unbalanced charge per unit of power is 0.1, noting that the unbalanced charge cannot be set too small to prevent agents from fraudulently prizing the consumer's subscription at as low a price as possible without purchasing power from the producer. Furthermore, considering that the real user has some inertia to subscriptions, we set the user's selection preference to {35,30,20,10,5} to represent that 35% of the possible consumers select subscriptions with the lowest contract price, 30% of the possible consumers select subscriptions with the second lowest contract price, and so on. At the same time, the producer will choose from the beginning of the price increase according to the selection preference. In the experiment, we set the initial price per unit of electricity, the price of electricity sold was 0.13 and the price of electricity purchased was 0.1. Whereas subjective marginal benefit of the market mu _L Set to 0.02. For the period of operation, we set 300, the first 200 periods are learning phases, the agent learns in this phase, and the latter hundred periods are statistical phases, the total profit of each agent in this phase is used as the standard for finally judging whether the agent algorithm has competitiveness. Each period has 10 days, 24 hours per day, i.e. 240 basic time units per period. For Q-learning we use the ε -greedy strategy.

Q-network experiments for the design of neural networks we set up a network comprising two hidden layers. The input layer receives the state, and in order to fully utilize the information given by the environment, the state characteristics are designed to be the contract prices of the users of all agents and the number of subscribers of the users of the agent at the last moment; and the information of the user contract prices of all agents and the user subscription numbers of the agent agents at the moment is added, and the total number of the input units is 24. The output layer has six output units, each representing six actions for operating on a price, the output value representing the desired jackpot that is selected in the input state and then continued according to the strategy. In addition, we used XAVER initialization parameters, using the RMSPROP algorithm and gradient descent to train parameters of the neural network. Furthermore, we repeatedly run 20 rounds of total rewards over the whole experiment to average out the final performance of the agent using the Q-learning algorithm based on Q-network in the experiment, while demonstrating the advantages and disadvantages of using Q-network to store the action values compared to the previous agent using the Q-learning algorithm based on Q-table. Note that the state of the proxy using the Q-table is designed as the last time and the combination of PRS and PS indexes at this time, along with the setting of the previous work.

Table 4-1 average rewards for each of the 20 agent runs

Table 4-2 average rewards for each of 20 runs

From the above two tables we can see that the agent using Q-learning is significantly more competitive than the agents of other strategies, while the greedy strategy is the only agent with positive total benefit except for the reinforcement learning agent, the agent with fixed strategy has the least total benefit, because its price will not change all the time, so it is easy to be defeated, while the agent with balanced and greedy strategies has two and three total benefits, illustrating the superiority of the adaptive strategy, while the agent with reinforcement learning algorithm leads the other agents significantly, illustrating the superiority of learning from the past. The proxy effect of using Q-network to store past experience is better than that of using Q-table to store experience, indicating the importance of having a more accurate state representation. As can also be seen from the following figures, the proxy revenue using Q-network appears more stable, substantially around 1500000, while the proxy revenue using Q-table fluctuates relatively more and relatively less stable, as shown in fig. 5.

Experiments with the multi-intelligent agent we have also performed experiments with the multi-intelligent agent, but first we need to modify the configuration of some grid environments. Firstly, because two groups of producers and two groups of consumers are respectively arranged, in order to approach the original experimental parameters as much as possible, the number of consumers which only use electricity in the daytime is 500, the number of consumers which use electricity in the whole day is 500, the electricity consumption condition of the consumers which only use electricity in the daytime in one day is {0,0,0,0,0,0,10,10,10,10,10,10,10,10,10,10,10,10,0,0,0,0,0,0}, namely the first six hours of the day and the last six hours of the day are not used, and the electricity consumption of the consumers which use electricity in the whole day in each hour is 10 electric power basic units. Further, we set the number of producers of wind power generation to 50, set the number of producers of solar power generation to 50, wind power generation to generate electricity all day, the amount of electricity generated per hour to be 100 electric power basic units, and the electricity generation condition of solar power generation all day to be {0,0,0,0,0,0,100,100,100,100,100,100,100,100,100,100,100,100,0,0,0,0,0,0}. At the same time, we slightly adjust the selection preferences of different kinds of producers and consumers. In experiments, to better compare previous works, we use Q-table as a structure to store action values, while the input state increases the characteristic index of day or night at the current time compared to previous works.

TABLE 4-3 selection preferences for heterogeneous users

In the experiment, because the external conditions of the experiment are changed, in order to illustrate the ' external competition ', the effect of the inter-cooperation ' multi-agent is better than that of the original single agent, and the multi-agent and the single agent are put into the experiment together for competition. The test was run repeatedly to average the results below.

Table 4-4 average benefit of 20 rounds of experiments in multiple classes of user environments

As can be seen from the above table and the broken line in FIG. 6, the multi-agent and single-agent agents possess absolute advantage in competition, and the multi-agent can defeat the single agent at each round, which means that the multi-agent is more suitable for market and has stronger competitiveness than the single agent. Meanwhile, we see that the total income of each agent is greatly improved compared with the experiment in the section 4.2, we hypothesize that because we remove the agent with a fixed price strategy in the original experiment, a multi-intelligent agent capable of adapting to the environment and adjusting the price to balance the supply and demand is added, so that the whole market is dominated by the agent capable of balancing the supply and demand, and compared with the experiment in the previous section, the electric quantity supply and demand balance of the whole smart grid market is ensured, the unbalanced cost of the agent is greatly reduced, and the whole total income level is improved.

To verify that the design concept of "internal collaboration" works, we set the rewarding function of each agent as the whole rewarding function of the agent separately

And our designed reward function->

After 10 experiments are carried out respectively, the average total income obtained by using the monomer rewarding function designed by us is higher than the average total income obtained by directly using the whole rewarding function by 23.37%, which shows that the monomer rewarding function has better effect than the whole rewarding function, so that each intelligent agent considers own benefit and whole benefit at the same time, maximizes own benefit based on the whole benefit, and is more flexible and more targeted than a mode of only considering the whole benefit.

The experiment of the multi-intelligent agent under the real data simulation condition comprises the steps that firstly, users after data cleaning are clustered, the users are classified into 5 types according to other experience data, and the number of people is distributed as {215;97; 317. 274;79}. The resulting electricity usage curve is shown in fig. 7.

It can be seen from the figure that the electricity usage profile for each class of users varies greatly, which creates a great challenge for our agents. In addition, in order to model users more truly, a selection model of the users in the power grid is modeled, random psychological prices in a certain range are distributed to each user according to the general feature of the dependence of the users in the power grid, and when the proxy current bid signed by the user last time is better than the psychological expectation of the user, the user can select a continuation; otherwise, the user selects the power contract according to a certain probability according to the price quality reordering.

TABLE 4-1 user price selection model

The data of 2013 years and 2 months in the London household user power data are selected as the power consumption data of consumers, and in order to ensure the overall power consumption supply and demand balance, two types of producers are respectively assigned to bear half of the power generation tasks. Although the power supply and demand of the system is balanced, since the power consumption behavior of each consumer is different, and the selection behavior of each user is also different, it is very difficult to balance the supply and demand inside the agent itself at the moment. The user psychological price random range is [0.10,0.15], the training period is 50, the evaluation period is 10, and the length of the time sequence state is 3. The benefits of the final evaluation period are shown in fig. 8.

Pricing issues for retail agents in smart grid retail markets are discussed. We first apply DRL in retail agent design to solve the discrete state space problem and use LSTM and DTW-based clustering mechanisms to strengthen our agent is its better application and practical environment. By clustering clients, we have designed a proxy framework for collaborative multi-agent deep reinforcement learning with unique incentive functions. Finally, we verify the adaptability and strong competitiveness of our agent framework in complex environments by introducing home electricity consumption data in london. As a future effort, we will explore the application of more advanced DRL techniques (e.g., actor-critic algorithm) to our retail agent design to produce more efficient pricing strategies.

In addition, by considering actual small-scale power generation data and household power storage devices, a proxy mechanism can be further generalized to obtain a more realistic smart grid. Load prediction of power grid is also a subject of intense research ^{[29][30][31][32]} For future work, we will start from this point, perform accurate classification and modeling analysis on the user, let the agent automatically recognize the user's category, and then let the corresponding agent inside manage the user.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The multi-agent deep reinforcement learning agent method based on the smart grid is characterized by comprising the following steps of:

Representing agent B _k Internal agent, i.e { C ₁ ,C _2, P ₁ ,P ₂ }，κ _t,C Representing the amount of electricity a consumer of a certain class consumes at time t, κ _t,P Representing the amount of electricity a certain class of producers produces at time t,/->

Is the unbalanced part of the cost in calculating the profit of the monomer, < + >>

Is at each time t, agent B _k All keep track of the number of consumers subscribed to by themselves, < >>

Is each agent B _k (k=1, 2, … K) offer to the consumer at time t,/->

Is at each time t, agent B _k All keep track of the number of producers subscribed to by oneself, < > and>

is each agent B _k (k=1, 2, … K) bid to the producer at time t;

the step S1 further comprises the following steps:

s11, initializing parameters of the neural network;

s15, calculating a standard value corresponding to the current state to update the neural network parameters so as to store Q (S) _t ,a _t ) Close to y _t Wherein Q (s _t ,a _t ) Is a value function network using neural network fitting, y _t Is a true long-term cumulative return;

the step S2 comprises the following steps:

s21, classifying consumers according to the power consumption difference;

s22, classifying producers according to the actual power generation condition;

in the step S3, each agent considers the benefit of itself at the same time as considering the benefit of itself in selecting actions through the bonus function.

2. The multi-agent deep reinforcement learning agent method according to claim 1, wherein the step S15 stores the action values in parameters, and only inputs the feature values into the neural network in order every time a new state is entered, and the Q (S, a) value maximum action can be selected from the output layer of the neural network as the next execution action, wherein Q (S, a) is a value function network fitted using the neural network.

3. The multi-agent deep reinforcement learning agent method according to claim 2, wherein the consumers are classified into daytime consuming users and full-day consuming users according to the condition of consuming power.

4. The multi-agent deep reinforcement learning agent method according to claim 3, wherein the producer is classified into an all-day producer and a daytime producer according to actual power generation conditions.