CN107948083B

CN107948083B - SDN data center congestion control method based on reinforcement learning

Info

Publication number: CN107948083B
Application number: CN201711081371.7A
Authority: CN
Inventors: 金蓉; 王伟明; 李姣姣; 庹鑫
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2021-03-30
Anticipated expiration: 2037-11-07
Also published as: CN107948083A

Abstract

The invention discloses an SDN data center congestion control method based on reinforcement learning. The method is based on the network background of the SDN, proposes a congestion control idea based on the flow, introduces a Q-learning algorithm in reinforcement learning, intelligently distributes the flow rate in a global mode, enables the utilization rate of a data link of the network to be as high as possible, and simultaneously enables the whole network to avoid congestion, thereby realizing the congestion control of a data center. Firstly, modeling a quintuple group to describe a problem; then, an improved Q-learning algorithm is provided, and a Q matrix is trained; and finally, according to the request of the flow, performing congestion control by using the Q matrix obtained by training. The invention provides the self-adaptive SDN data center congestion control method which is good in control effect, good in stability and high in efficiency, and the control algorithm is easy to implement. The invention provides an intelligent solution based on reinforcement learning for the congestion control problem of an SDN data center.

Description

SDN data center congestion control method based on reinforcement learning

Technical Field

The invention relates to the technical field of Network communication, in particular to a congestion control method of an SDN (Software Defined Network) Data Center Network (DCN) based on reinforcement learning.

Background

In recent years, cloud computing has become a hotspot and a future trend in the field of information-based construction, and the number of users of many new internet online services (such as search, social networking, instant messaging, etc.) is also rapidly increasing. In the rapid development process of cloud computing and internet online business, a data center serving as an information infrastructure is in a core position all the time. With the development of business and the use of new technologies, data centers are undergoing and developing significant changes and trends, thereby bringing new challenges and problems to data center networks. Emerging services require a large amount of one-to-many and many-to-many communication between servers, with the result that data center internal traffic is dramatically increasing and exhibits new characteristics that differ from internet traffic. Under the current technical conditions, a data center network is frequently congested, so that packet loss is increased, time delay is increased, throughput is reduced, and service performance and service quality are seriously affected. In order to ensure the performance and the service quality of the service, the flow management and optimization problem of the data center network becomes an important problem to be solved urgently at present.

Reinforcement Learning (Reinforcement Learning) is developed from theories such as animal Learning, stochastic approximation, optimization control and the like, and is an online Learning technology without instructor. By learning the mapping from the environment state to the behavior, the behavior selected by the intelligent agent can obtain the maximum reward of the environment, so that the external environment can be the best to evaluate the learning system in a certain sense (or the running performance of the whole system). The Q-learning algorithm is a model-independent reinforcement learning algorithm that uses reward discount and Q value of "state-action" pairs as estimation functions in iterations, requires each action to be considered in each learning iteration, and ensures that the learning process converges. The Q-learning algorithm can learn without prior knowledge, and has wide application prospect in solving complex optimization decision problems.

The invention provides an SDN data center congestion control method based on reinforcement learning. The method is based on the network background of the SDN, proposes a congestion control idea based on flow (flow), introduces a Q-learning algorithm in reinforcement learning, intelligently distributes the flow rate in a global mode, enables the utilization rate of a data link of the network to be as high as possible, and simultaneously enables the whole network to avoid congestion, thereby realizing congestion control of a data center. The invention provides an intelligent solution based on reinforcement learning for the congestion control problem of an SDN data center, which can optimize the use of network resources of the data center and improve the throughput, service performance and service quality of a network, thereby ensuring the healthy development of emerging services of the Internet and cloud computing, promoting the energy conservation of the data center and contributing to the realization of green communication.

Disclosure of Invention

The invention aims to solve the problem of congestion control of a data center network based on an SDN framework, and provides a congestion control method of the SDN data center network based on reinforcement learning.

The purpose of the invention is realized by the following technical scheme: an SDN data center congestion control method based on reinforcement learning specifically comprises the following steps:

step 1: and the reinforced learning is introduced into a data center based on the SDN, so that the problem of congestion control is solved. The SDN based data center congestion control problem is first described as the five-tuple < F, S, R, a, Q >.

The reinforcement learning is a guide-less online learning technology, an agent (agent) senses state information in an environment, selects an optimal action to cause the change of the state and obtain a report value, updates an evaluation function, enters the next round of learning training after completing a learning process, repeats cycle iteration until the condition of the whole learning is met, and terminates the learning. The SDN-based data center congestion control problem refers to a flow-based congestion control problem, namely, the rate is allocated to all flows in a lump, so that the flow rate request is met as far as possible, and the whole data center network is guaranteed not to generate congestion.

The five-membered group is described as < F, S, R, A, Q >. F (flow) represents the flow to be allocated with a queue length N; s (link state) represents the state of the entire link, and is a vector of length M; r (reward) represents a matrix of reward values obtained after selection of an action. A (action) represents the behavior of allocating rate to flow according to the link demand, and is a vector with the length of N; q (Q-matrix) represents a Q matrix obtained through training and is used for representing the knowledge that the agent has learned from experience.

Step 2: according to the problem requirement, an improved Q-learning algorithm is provided, and the Q matrix is trained.

The Q-learning algorithm is one of the classic algorithms in the reinforcement learning algorithm. Each state behavior pair corresponds to a relevant Q value, execution behaviors are selected according to the Q values in the algorithm, and the optimal strategy is obtained by estimating the value function of the state behavior pairs.

Based on an improved Q-learning algorithm, the training of the Q matrix specifically comprises the following steps:

and 2-1, giving a reward matrix R according to certain prior knowledge. And initializes the Q matrix.

2-2, improving the method for selecting action in the Q-learning algorithm in reinforcement learning. The classical Q-learning algorithm selects the action corresponding to the maximum reward in the R matrix according to the current state. The improved Q-learning algorithm combines the current state and the path passed by the current flow to select the action corresponding to the maximum reward in the R matrix.

2-3. action is performed, reward and new link state are observed, and the Q value Q (S, a) is iteratively updated according to Q (S, a) ← Q (S, a) + α [ r + γ max Q (S ', a') -Q (S, a) ].

The updating formula is a formula for updating the iteration Q value by the Q-learning algorithm. Wherein Q (S, a) represents a Q value after performing action a in the current state S, Q (S ', a') represents a Q value after performing action a 'in the next state S', r is a reward after performing action a in the current state S, γ is a discount factor, α is learning efficiency, γ max Q (S ', a') is a discount reward for the subsequent state, and γ max Q (S ', a') -Q (S, a) constitutes an improved estimate of the reward for the subsequent state.

And 2-4, circularly executing the Q matrix training process until s is in a final state to obtain the trained Q matrix.

And step 3: and (3) according to a specific flow request, combining the Q matrix obtained in the step (2) after training to perform congestion control.

The specific congestion control method comprises the following steps:

3-1, specific N flow requests are clarified, and the quantification standard of the occupied bandwidth of the link is determined.

And 3-2, inputting a flow request, acquiring the current link state, considering the link passed by the current flow, and selecting an action with the maximum reward to execute according to a Q matrix obtained by Q algorithm training, namely selecting the speed for the current flow. The current link state is then updated while recording the rate assigned to the current flow.

And 3-3, judging whether all the N flows are distributed completely. If not, it is necessary to return to step 3-2 to continue the loop until all flows are allocated rates.

And 3-4, outputting N mapping tables of flow and rate, thereby performing global congestion control on the data center.

The invention has the beneficial effects that: the invention provides an intelligent solution based on reinforcement learning for the congestion control problem of an SDN data center, which can optimize the use of network resources of the data center and improve the throughput, service performance and service quality of a network, thereby ensuring the healthy development of emerging services of the Internet and cloud computing, promoting the energy conservation of the data center and contributing to the realization of green communication.

Drawings

FIG. 1 is a system architecture diagram.

Fig. 2 is a data center network topology diagram adopted by the embodiment.

FIG. 3 is a flow chart of a training algorithm.

Fig. 4 is a flow chart of a congestion control method.

Fig. 5 is a diagram showing the variation of the bandwidth of each link in the embodiment.

Fig. 6 is a rate allocation diagram of flows in an embodiment.

Detailed Description

The invention is further described below with reference to the figures and examples.

The invention provides an SDN data center congestion control method based on reinforcement learning, which comprises the following steps:

step 1: and the reinforced learning is introduced into a data center based on the SDN, so that the problem of congestion control is solved. The data center congestion control problem is first described as the five-tuple < F, S, R, a, Q >.

Reinforcement learning is a guide-less online learning technology, an agent (agent) senses state information in an environment, selects an optimal action to cause the change of the state and obtain a report value, updates an evaluation function, enters the next round of learning training after completing a learning process, repeats cycle iteration until the condition of the whole learning is met, and terminates the learning. The SDN data center congestion control problem based on reinforcement learning refers to a congestion control problem based on flow, namely, the rate is allocated to all the flow in a comprehensive mode, so that the rate request of the flow is met as far as possible, and the whole data center network is guaranteed not to generate congestion.

The quintuple is described as < F, S, R, A, Q >. F (flow) represents the flow to be allocated with a queue length N; s (link state) represents the state of the entire link, and is a vector of length M; r (reward) represents a matrix of reward values obtained after selection of an action. A (action) represents the behavior of allocating rate to flow according to the link demand, and is a vector with the length of N; q (Q-matrix) represents a Q matrix obtained through training and is used for representing the knowledge that the agent has learned from experience.

Step 2: according to the problem requirement, an improved Q-learning algorithm is provided, and a Q matrix is trained.

The step 2 specifically comprises the following steps:

The updating formula is a formula for updating the iterative Q value by the Q-learning algorithm. Wherein Q (S, a) represents a Q value after performing action a in the current state S, Q (S ', a') represents a Q value after performing action a 'in the next state S', r is a reward after performing action a in the current state S, γ is a discount factor, α is learning efficiency, γ max Q (S ', a') is a discount reward for the subsequent state, and γ max Q (S ', a') -Q (S, a) constitutes an improved estimate of the reward for the subsequent state.

And 2-4, circulating the steps until S is in a final state. And obtaining a trained Q matrix.

The specific congestion control method comprises the following steps:

The quintuple describes the congestion control problem of a data center with a software defined network architecture as quintuple < F, S, R, A, Q >, describes the flow of the rate to be distributed as F, describes the link state as S, describes the reward as R matrix, describes the rate distribution as action A, and records the training result of the agent as Q matrix.

Examples

In order to facilitate the understanding and implementation of the present invention for those skilled in the art, the technical solutions of the present invention will be further described with reference to the accompanying drawings, and a specific embodiment of the method of the present invention is provided.

The invention introduces the reinforcement learning method into the data center based on the software defined network, and solves the problem of congestion control. Fig. 1 is a system architecture diagram, and the basic functions of each module are: (1) a perception module: adopting the current link state information of the data center network; (2) a learning module: learning the received link state information or obtaining quantitative information according to related experience knowledge to provide decision basis for a decision module; (3) a decision module: according to the information provided by the learning module, a corresponding control strategy is formulated; (4) an execution module: and executing the control strategy made by the decision module. The learning module of this embodiment adopts an improved Q-learning algorithm, and selects an Action corresponding to the maximum reward in the R matrix according to the current state with a classic Q-learning algorithm. The improvement is that the action corresponding to the maximum reward is selected in the R matrix by combining the current state and the two conditions of the path passed by the current flow in the Q-learning algorithm. The Q matrix obtained by the training of the learning module is provided for the decision module. And the decision module allocates a rate to each flow according to the Q matrix to realize congestion control.

Fig. 2 is a network topology diagram of an SDN data center employed in the embodiment. The whole network has 5 links, and the link bandwidth is 8G. The length of the flow queue adopted in the embodiment is 10.

The specific congestion control method comprises the following steps:

step 1: the data center congestion control problem is described as the five-tuple < F, S, R, a, Q >.

The congestion control problem of the data center is described as five-tuple < F, S, R, A, Q >, flow of the rate to be distributed is described as F, link state is described as S, reward is described as R matrix, rate distribution is described as action A, and the training result of the agent is recorded as Q matrix.

F (flow) -indicates the queue length of the flow to be allocated, where the queue length of the flow to be allocated with bandwidth in this embodiment is 10, and in addition, this embodiment has 5 links in total, and each flow occupies two links. The flow can be expressed as:

F＝(flow₁,flow₂,...,flow_i,...,flow₁₀) (1)

flow of formula (1)_iThe values of (A) are as follows:

flow_i∈{f_jkj, k e, wherein j, k e1,2,...,5 (2)

F in formula (2)_jkIndicates flow_iOccupying j, k two links.

(link state) -the state of the entire link is represented, and is a vector of length 5. Can be expressed as:

S＝(ls₁,ls₂,...,ls_i,...,ls₅) (3)

ls in formula (3)_iThe values of (A) are as follows:

ls_i∈{g_j}, where j ∈ 1, 2.., 8 (4)

G in formula (4)_iRepresenting the quantization level of the used bandwidth of the link.

In this embodiment, we divide the occupied bandwidth of the link into 8 levels, B is the bandwidth of the link, i.e. the maximum transmission rate, and discretize the state of the link into 8 levels as shown in the following table:

further, the link bandwidth B in this embodiment is 40G.

(action) -the behavior of allocating rate for flow according to link demand is represented by a vector with length of 10, which can be represented as:

A＝(a₁,a₂,...,a_i,...,a₁₀) (5)

a in formula (5)_iThe values of (A) are as follows:

a_i∈{1,2,3,4,5}

r (reward) -a matrix representing the prize value obtained after an Action is selected. We can represent their prize values by the current state S as a row and the next state S as a column. It is an 8⁵Line 8⁵A matrix of columns.

R in formula (6)_ijRepresents a state S_iAfter executing a certain action, transition is made to state S_jThe reported value obtained.

As for the determination of Reward, various schemes are possible. In this embodiment, we adopt the following specific scheme: a unimodal function F ═ min (i/7,100 ═ 35-i) is used, where i denotes the size of the bandwidth occupancy of the link. Then reward has the following two cases:

q (Q-matrix) -represents a Q matrix obtained through training and is used for representing knowledge that the Agent has learned from experience. The Q matrix is of the same order as the R matrix, the row of the Q matrix represents the current state S, and the column represents the next state S after corresponding action is taken.

Q in the formula (8)_ijRepresents a state S_iTransition to State S_jKnowledge learned by Agent.

Step 2: according to the problem requirement, an improved Q-learning algorithm is provided, and a Q matrix is trained. And obtaining a trained Q matrix.

In the architecture framework diagram of the Q-learning based congestion control system shown in fig. 1, the whole process mainly includes the following parts: the detector collects flow information and inputs the flow information into a detection state/processor for analysis and processing; inputting all link state information into a Q-learning optimization control decision-making device; the Q-value of the strategy is obtained in a Q-learning control decision device; the strategy decision device can obtain a better flow distribution strategy; and through continuous circulation, all flow allocation strategies on all links are found, so that the congestion control of the whole data center is realized.

According to the problem requirement, an improved Q-learning algorithm is provided for training a Q matrix. The classical Q-learning algorithm selects the action corresponding to the maximum reward in the R matrix according to the current state. The improved Q-learning algorithm combines the current state and the path passed by the current flow to select the action corresponding to the maximum reward in the R matrix. The improved algorithm is described as follows:

fig. 3 is a Q training flow diagram. The method specifically comprises the following steps:

2-1, giving the reward matrix R according to the step 1. And initializes the Q matrix. The initial load of the 5 links is [18,20,18,14,29 ].

2-2, selecting action according to the improved Q-learning algorithm.

And updating the formula, namely updating the iterative Q value by using the Q-learning algorithm. Wherein Q (S, a) represents a Q value after performing action a in the current state S, Q (S ', a') represents a Q value after performing action a 'in the next state S', r is a reward after performing action a in the current state S, γ is a discount factor, α is learning efficiency, γ max Q (S ', a') is a discount reward for the subsequent state, and γ max Q (S ', a') -Q (S, a) constitutes an improved estimate of the reward for the subsequent state.

And 2-4, iterating the loop until s is in a final state.

A flowchart of a specific congestion control method is shown in fig. 4, and specifically includes the following steps:

3-1, giving 5 links of the data center network, determining quantization standards g 1-g 8 of occupied bandwidth of the links, wherein flow requests to be distributed are 10, and the specific occupied links and bandwidth requirements are as follows:

flow1

flow2

flow3

flow4

flow5

flow6

flow7

flow8

flow9

flow10

seizing a link

l₁,l₂

l₁,l₃

l₁,l₄

l₁,l₅

l₂,l₃

l₂,l₄

l₂,l₅

l₃,l₄

l₃,l₅

l₄,l₅

Bandwidth required (G)

5

And 3-2, inputting 10 flow requests, setting the initial load of 5 links as [18,20,18,14,29], considering the links passed by the current flow, and selecting an action with the maximum reward to execute according to a Q matrix obtained by Q algorithm training, namely selecting the speed for the current flow. The current link state is then updated while recording the rate allocated for the current flow.

And 3-3, judging whether the 10 flow are completely distributed. If not, it is necessary to return to step 3-2 to continue the loop until all flows are allocated rates.

3-4, outputting a mapping table of 10 flows and rates, as follows:

	flow1	flow2	flow3	flow4	flow5	flow6	flow7	flow8	flow9	flow10
											seizing a link	l₁,l₂	l₁,l₃	l₁,l₄	l₁,l₅	l₂,l₃	l₂,l₄	l₂,l₅	l₃,l₄	l₃,l₅	l₄,l₅
Bandwidth required (G)	5	5	5	5	5	5	5	5	5	5
											Allocation of bandwidth (G)	4	4	4	1	5	1	1	2	1	3

Fig. 5 shows a graph of the change in bandwidth per allocation of each link. On the abscissa 0, represents the initial bandwidth occupancy [18,20,18,14,29 ]; when the abscissa is 1, the bandwidth occupation condition of each link after the bandwidth is allocated to the first flow is represented, in this embodiment, the first flow occupies link 1 and link 2, and the allocated rate is 3G. As can be seen from fig. 5, after the rate assignment of 10 flows is completed, no congestion occurs in all links. The method can effectively realize congestion control.

Fig. 6 shows a rate allocation diagram for a flow. Fig. 6 shows 5G with 1 flow allocated the link demand, 3 flows allocated 4G on demand, 1 flow allocated 3G on demand, 1 flow allocated 2G on demand, and 4 flows allocated only 1G. The bandwidth requirement of each flow is met as much as possible, and meanwhile, the data center network does not generate congestion.

If the bandwidth is allocated completely on demand without the congestion control method of the present invention, the result would be that each flow is allocated 5G of bandwidth on demand, but the bandwidth of the actual link may not meet the demand of each flow, resulting in congestion.

The congestion control method of the present invention has been described above with reference to specific embodiments. The embodiment shows that the congestion control method of the data center provided by the invention is effective. The method can perform flow-based congestion control on the SDN data center network, and performs overall rate distribution on the flow by using the controller, so that congestion can be avoided, and the bandwidth utilization rate can be as high as possible.

Claims

1. An SDN data center congestion control method based on reinforcement learning is characterized by comprising the following steps:

step 1: introducing an enhanced learning method into a data center based on a software defined network, and describing a congestion control problem of the data center based on an SDN into a quintuple < F, S, R, A, Q >; wherein F represents the flow to be distributed, and the queue length is N; s represents the state of the whole link and is a vector with the length of M; r represents a matrix of reward values obtained after selecting action; a represents the behavior of allocating rate to flow according to the link demand, and is a vector with the length of N; q represents a Q matrix obtained through training and used for representing the knowledge that the intelligent agent has learned from experience;

step 2: training a Q matrix based on an improved Q-learning algorithm; the method specifically comprises the following steps:

2-1, according to the prior knowledge, giving an incentive matrix R, and initializing a Q matrix;

2-2, improving a method for selecting action in a Q-learning algorithm in reinforcement learning, so that the algorithm combines two conditions of the current state and a path passed by the current flow, and selecting the action corresponding to the maximum reward in the R matrix;

2-3. performing action, observing reward and new link state, iteratively updating Q value Q (S, a) according to Q (S, a) ← Q (S, a) + α [ r + γ max Q (S ', a') -Q (S, a) ]; wherein Q (S, a) represents a Q value after performing action a in the current state S, Q (S ', a') represents a Q value after performing action a 'in the next state S', r is a reward after performing action a in the current state S, γ is a discount factor, α is learning efficiency, γ max Q (S ', a') is a discount reward for the subsequent state, and γ max Q (S ', a') -Q (S, a) constitutes an improved estimate of the reward for the subsequent state;

2-4, circularly executing a Q matrix training process until s is in a final state to obtain a trained Q matrix;

and step 3: according to a specific flow request, combining the Q matrix obtained in the step 2 after training to perform congestion control;

the specific congestion control method in step 3 includes the following steps:

3-1, determining the number N of flow requests and determining the quantitative standard of the link utilization rate;

3-2, inputting a flow request, acquiring the current link state, considering a link through which the current flow passes, and selecting an action with the maximum reward to execute according to a Q matrix obtained by Q algorithm training, namely selecting the speed for the current flow; then updating the current link state and recording the rate allocated to the current flow;

3-3, judging whether all the N flow are distributed completely; if not, then the process returns to step 3-2 to continue the cycle until all flows are allocated rates; if the distribution is finished, executing the step 3-4;

2. The reinforcement learning-based SDN data center congestion control method of claim 1, wherein: the SDN-based data center congestion control problem refers to a flow-based congestion control problem, namely, the rate is allocated to all flows in a lump, so that the flow rate request is met as far as possible, and the whole data center network is guaranteed not to generate congestion.