CN112968834B

CN112968834B - SDN route convergence method under reinforcement learning based on network characteristics

Info

Publication number: CN112968834B
Application number: CN202110145046.2A
Authority: CN
Inventors: 李传煌; 陈忠良; 汤中运; 谭天; 王峥; 方春涛; 陈超
Original assignee: Hangzhou DPTech Technologies Co Ltd; Zhejiang Gongshang University
Current assignee: Hangzhou DPTech Technologies Co Ltd; Zhejiang Gongshang University
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2022-05-24
Anticipated expiration: 2041-02-02
Also published as: CN112968834A

Abstract

The invention discloses an SDN route convergence method under reinforcement learning based on network characteristics, which applies reinforcement learning to SDN route convergence, uses a QLearning algorithm as a reinforcement learning model, and defines a direction factor theta to describe the direction of each transfer in a path according to an input network topology. And guiding the reinforcement learning agent to explore according to the theta value in the path transfer process. In the early stage of the epicode, the agent is allowed to select an action corresponding to the theta value being negative in the exploration phase, and the probability of the agent exploring the action corresponding to the theta value being negative is reduced with the continuous iteration of the epicode. Therefore, the exploration efficiency is improved while the agent obtains sufficient experience from the environment, and the generation of loops in the training phase is reduced. The method utilizes the characteristics of continuous interaction and strategy adjustment of reinforcement learning and network environment, and can find the optimal path in the route convergence process compared with the traditional route convergence algorithm.

Description

SDN route convergence method under reinforcement learning based on network characteristics

Technical Field

The invention relates to the field of network communication technology and reinforcement learning, in particular to an SDN route convergence method under reinforcement learning based on network characteristics.

Background

The reinforcement learning process may be summarized as the agent mapping from environmental states to behavioral actions such that the accumulated reward value is maximized. In the routing planning, the intelligent agent receives the current state information and the reward information from the routing system, and the action selected by the intelligent agent can be regarded as the input received by the routing system from the intelligent agent, and the action and the reward information in the current routing system can influence the action selection of the intelligent agent in a longer time later. In the whole routing planning system, the intelligent agent must learn the optimal action to maximize the accumulated reward value, and the action selected by the intelligent agent is the optimal path of the traffic. In the reinforcement learning task, Q-learning does not depend on an environment model, and in the limited Markov decision process, Q-learning can find the optimal strategy, and what agents need to do is to try in the system continuously to learn one strategy. The strategy is determined by the accumulated reward obtained after the strategy is executed, and the best strategy is to select the action with the maximum Q value in each state. Exploration refers to the agent selecting actions that have not been performed before, and exploitation refers to the agent taking the current optimal action from previously learned experience. In the invention, links which are not selected before are explored, namely selected, so that more possibilities are searched; and the known planning line is perfected by utilizing, namely selecting the currently selected link.

If the shortest path planning is realized by using reinforcement learning, agent's exploration trend is to find nodes with smaller and smaller depths. In the network, except for the link length, the link bandwidth, the delay, the hop count and the like can be taken as dominant characteristics, and various characteristics can also be taken as a new characteristic through weight addition. Through the dominant characteristics of the network topology, the agent can be guided to change from high random behavior in an exploration phase into efficient exploration, so that the learning network can converge more quickly. To avoid network convergence to sub-optimal solutions, we allow agent's exploration behavior to be highly random during the initial phase of training. With the increment of the training steps, by increasing the gradient difference of the dominant characteristics of each link, the agent can realize the transition from high random exploration to high-efficiency exploration, and can ensure the convergence of the optimal solution while improving the convergence speed.

Disclosure of Invention

The invention provides an SDN route convergence method under reinforcement learning based on network characteristics by combining reinforcement learning and solves the problems that a loop is easily formed around the conventional route convergence algorithm at present, an optimal path cannot be found and the like.

The technical scheme adopted by the invention for solving the technical problem is as follows: an SDN route convergence method under reinforcement learning based on network characteristics comprises the following steps:

step 1: establishing an SDN network area topological graph, and dividing network areas in a fine-grained manner;

step 2: defining a direction factor theta, setting a source node, wherein the direction factor theta is { -1, 0, 1}, when the direction factor theta from one node to another node in a topological graph is-1, the direction factor theta represents that the direction factor theta is close to the source node, when the direction factor theta from one node to another node in the topological graph is 1, the direction factor theta represents that the direction factor theta is far away from the source node, when the direction factor theta from one node to another node in the topological graph is 0, the shortest distances between the two nodes and the source node are the same, constructing a network topological hierarchical graph according to the theta values between different nodes in the topological graph and the relationship between the two nodes and the source node, and the shortest distances between all the nodes in each layer and the source node are the same;

and step 3: inputting the network topology hierarchical diagram obtained in the step 2 into a reinforcement learning model to guide agent exploration by using a Qlearning algorithm as the reinforcement learning model; when the height difference is oriented to 0, it means that the nodes are in the same layer, and at this time, the transition between the nodes in the same layer is less affected by the layering, and appears as a random exploration state of agent. And when the height difference between the layers is continuously increased, the agent is guided to explore the lower layer, and the efficient exploration state of the agent is shown.

The following formula is set:

h(θ)＝f(θt)

wherein t represents the iteration number of the epsilon, and step is a set threshold value. And f (t) taking an absolute value of theta according to the iteration progress of the epamode, and specifically determining the specific value of theta by the h (theta) according to the state of the corresponding action. With continuous iteration of the epicode, in the selectable actions, the value of θ corresponding to the action close to the source node becomes smaller, and the value of θ corresponding to the action far from the source node becomes larger, so that a formula D is defined:

the range of the value of the interval D is from 0 to the sum of theta values corresponding to all current optional actions, the interval is divided into n parts, n is the number of the current optional actions, and the length of each subinterval is the theta value corresponding to the action.

η＝random(D)

And obtaining the random number eta by carrying out equal probability value on the interval D.

Calculated by the following formula:

the function g (D, η) is the action corresponding to the interval D where the random number η is located, i.e. the agent searches for the action a to be selected.

The strategy formula of reinforcement learning based on the network area characteristics is as follows:

the epsilon-greedy policy balances exploration and utilization based on an exploration factor epsilon (epsilon 0, 1). Generating a random number sigma (sigma belongs to [0,1]), and when sigma is less than or equal to epsilon, using a random strategy by agent to explore the environment through randomly selecting actions to obtain experience; when σ > ε, agent uses a greedy strategy to leverage the experience that has been gained.

When agent generates state transition, inputting the current state s and the selected action a into a function R, generating reward to evaluate the state transition, and setting a reward function:

R_t(s,a)＝αB-βt+γδ(s_-d)-δ

r is the reward earned at node i, selecting a link to node j. Alpha, beta, gamma and delta are used as four positive parameters to weigh the weights of the four parts of rewards. B is the residual bandwidth of the link corresponding to the selected action, and t is the delay of the corresponding link. δ (s _ -d) is the stimulus function, s _ representing the state to transition after selecting action a based on state s.

And training the Q value table according to the set reward function, and obtaining a path Routing through the trained Q value table, wherein the path is the converged optimal path after the link fails.

Further, the fine-grained network area division specifically includes: and constructing a network connection matrix according to the SDN network topology, wherein the network connection matrix comprises the adjacency relation among all nodes of the network. Inputting the network connection matrix and the node number n in the network topology into a hub node election algorithm, recording the connection number of the nodes, and expressing as follows:

wherein, node _ link [ i ] is node i, T [ i ] [ j ] is link connected by node i, and the node with the highest link connection number is selected as the pivot node. The hub node and the adjacent nodes thereof are used as a divided network area.

Further, the process of training the Q-value table by the QLearning algorithm is specifically as follows:

and setting the maximum step number of the single training.

(1) Initializing a Q value table and a reward function R;

(2) adopting a strategy based on network area characteristics, and selecting an action a;

(3) executing action a, transferring to a state s _, calculating a reward value by using a reward function R, and updating a Q value table;

(4) and judging whether s _ is a destination node or not. If not, let s be s _, go back to (2). If s _ is the destination node, the training ends.

Further, when planning the backup path, the link bandwidth performance and the link delay are concerned, and therefore α is set to 0.4, β is set to 0.3, γ is set to 0.1, and δ is set to 0.2.

The invention has the beneficial effects that: the present invention defines a direction factor theta to describe the direction of each transition in the path. And guiding the reinforcement learning agent to explore according to the theta value in the path transfer process. Therefore, the exploration efficiency is improved while the agent obtains sufficient experience from the environment, and the generation of loops in the training phase is reduced. Compared with the traditional route convergence algorithm, the method can find the optimal path in the route convergence process by utilizing the characteristics of continuous interaction and strategy adjustment between reinforcement learning and the network environment.

Drawings

Figure 1 is a SDN network topology diagram;

fig. 2 is a network topology hierarchy diagram.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

Aiming at the problem that the existing SDN control adopts Dijkstra algorithm as the shortest route convergence algorithm, the invention tries to apply reinforcement learning to SDN route convergence. And directly using the network topology environment for the training of the Q value table by utilizing the characteristic of SDN forwarding control separation. Considering that the residual bandwidth of each link in the topology dynamically changes along with the forwarding operation of different flows, the invention introduces a reinforcement learning technology, and utilizes the advantages of a reinforcement learning self-exploration environment to deal with the dynamics of a network environment, thereby finding an optimal route convergence path under the condition of ensuring the route convergence speed.

The invention provides an SDN route convergence method under reinforcement learning based on network characteristics, which comprises the following steps:

step 1: establishing an SDN network area topological graph, and dividing network areas in a fine-grained manner; the method specifically comprises the following steps: and constructing a network connection matrix according to the SDN network topology, wherein the network connection matrix comprises the adjacency relation among nodes of the network. Inputting the network connection matrix and the node number n in the network topology into a hub node election algorithm, recording the connection number of the nodes, and expressing as follows:

wherein, node _ link [ i ] is node i, T [ i ] [ j ] is link connected by node i, and the node with the highest link connection number is selected as the pivot node. The hub node and its neighboring nodes serve as a divided network area, as shown in fig. 1.

And 2, step: defining a direction factor theta, setting a source node, wherein the direction factor theta is set to be { -1, 0, 1}, when the direction factor theta from one node to another node in a topological graph is set to be-1, the direction factor theta represents that the direction factor theta is close to the source node, when the direction factor theta from one node to another node in the topological graph is set to be 1, the direction factor theta represents that the direction factor theta is far from the source node, when the direction factor theta from one node to another node in the topological graph is set to be 0, the shortest distances between the two nodes and the source node are the same, and a network topological hierarchical graph is constructed according to the theta values between different nodes in the topological graph and the relationship between the theta values and the source node, wherein as shown in fig. 2, the shortest distances between all nodes in each layer and the source node are the same;

and step 3: inputting the network topology hierarchical diagram obtained in the step 2 into a reinforcement learning model to guide agent exploration by using a Qlearning algorithm as the reinforcement learning model; when the height difference is oriented to 0, it means that each node is almost in the same layer, and then the transition between nodes in the same layer is less affected by the layering and appears as a random exploration state of agent. And when the height difference between the layers is continuously increased, the agent is guided to explore the lower layer, and the efficient exploration state of the agent is shown.

The following formula is set:

h(θ)＝f(θt)

wherein t represents the iteration number of the epsilon, and step is a set threshold value. And f (t) taking an absolute value of theta according to the epicode iteration progress, and specifically determining a specific value of theta by using h (theta) according to the state of the corresponding action. With continuous iteration of the epicode, in the selectable actions, the value of θ corresponding to the action close to the source node becomes smaller, and the value of θ corresponding to the action far from the source node becomes larger, so that a formula D is defined:

the range of the value of the interval D is from 0 to the sum of theta values corresponding to all current optional actions, the interval is divided into n parts, n is the number of the current optional actions, and the length of each subinterval is the theta value of the corresponding action.

η＝random(D)

Calculated by the following formula:

R_t(s,a)＝αB-βt+γδ(s_-d)-δ

r is the reward earned at node i, selecting a link to node j. Alpha, beta, gamma and delta are used as four positive value parameters to weigh four parts of rewarding weights. B is the residual bandwidth of the link corresponding to the selected action, and t is the delay of the corresponding link. δ (s _ -d) is the stimulus function, s _ representing the state to transition after selecting action a based on state s.

Training the Q value table according to the set reward function, which comprises the following specific steps:

and setting the maximum step number of the single training.

(1) Initializing a Q value table and a reward function R;

And obtaining a path Routing by the trained Q value table, wherein the path is the converged optimal path after the link fails.

One specific application example of the present invention is as follows:

step 1: an SDN network area topological graph is constructed, and a MINET is used for constructing the network topological graph shown in the figure 1, wherein the network topological graph comprises 16 OpenFlow switches and 5 hosts. Step 2: the QLearning algorithm is used as a reinforcement learning model, and the route convergence method provided by the invention uses a Markov decision process to carry out modeling, so that the model MDP quadruple provided by the invention is defined as follows:

(1) state collection: in the network topology, each switch represents a state, and therefore, according to the network topology, the invention defines a network state set as follows:

S＝[s₁,s₂,s₃,…s₁₆]

wherein s is₁～s₁₆Representing 16 OpenFlow switches in the network. The source node information of a packet indicates the initial state, destination, of the packetIndicates the termination status of the packet. When a certain data packet reaches the destination node, the data packet reaches the termination state. Once the current data packet reaches the termination state, the termination of one round of training is indicated, and the data packet will return to the initial state again for the next round of training.

(2) An action space: in an SDN network, the transmission path of a data packet is determined by the network state, i.e. the data packet can only be transmitted at connected network nodes. According to the network topological graph, the invention defines the network connection state as shown in the following formula:

since packets can only be transmitted at connected network nodes, the present invention can define a set of actions for each state S [ i ] ∈ S according to the set of network states and the network connection state as follows:

A(s_i)＝{s_j|T[s_i][s_j]＝1}

indicates that the current state is at s_iThe state-selectable action set appears as s on the network topology_iDirectly connected nodes s_jI.e. the current state s_iWill only select the state s connected to it_j. For example: state s₁The action set of (1) is: a(s)₁)＝{s₂,s₄}。

(3) And (3) state transition: in each round of training, when the data packet is in state s_iIf the action is not the selected state of the round, the data packet moves to the next state.

(4) The reward function:

R_t(s,a)＝αB-βt+γδ(s_-d)-δ

the present invention focuses on the link bandwidth performance and the link delay when planning the backup path, and therefore, α is set to 0.4, β is set to 0.3, γ is set to 0.1, and δ is set to 0.2.

In the system model, each time a data packet passes through one switch, a negative reward is obtained to represent the forwarding cost of the data packet, and the more switches pass through during forwarding, the more accumulated negative rewards are, and the higher the cost is; in order to increase the link bandwidth utilization rate, the data packet is encouraged to select a link with high link bandwidth utilization rate, and each time the data packet passes through one switch, the reward with the size equal to the size of the link utilization rate can be obtained; in order to force the data packet to reach the destination node as soon as possible, when the data packet reaches the destination node, an extra size 1 is obtained, which is expressed by the formula:

in the formula s_iIndicating the current state, i.e. the current packet is on switch number i, a_jIndicating that the switch numbered j is selected.

In the invention, a network region characteristic strategy is adopted to carry out reinforcement learning model training.

After determining the MDP quadruple, when a certain link fails, a new path is searched from a source node to a destination node, and a Q value table is trained by using a QLearning algorithm:

and setting the maximum step number of the single training.

(1) Initializing a Q value table and a reward function R;

(4) and judging whether s' is a destination node or not. If not, let s be s _, go back to (2).

In the routing convergence planning process based on reinforcement learning, the learning rate alpha is set to be 0.8, the discount rate gamma is set to be 0.6, and the value of the action strategy epsilon-greedy strategy epsilon is epsilon-0.3.

And obtaining a path Routing according to the trained Q value table, wherein the path is the converged optimal path after the link fails.

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims

1. An SDN route convergence method under reinforcement learning based on network characteristics is characterized by comprising the following steps:

and step 3: inputting the network topology hierarchical diagram obtained in the step 2 into a reinforcement learning model to guide agent exploration by using a Qlearning algorithm as the reinforcement learning model; when the height difference orientation is 0, the nodes are in the same layer, and the transfer between the nodes in the same layer is less influenced by layering, and the transfer is represented as a random exploration state of an agent; when the height difference between the layers is continuously increased, the agent is guided to explore the lower layer and the high-efficiency exploration state of the agent is shown;

the following formula is set:

h(θ)＝f(θt)

wherein t represents the iteration frequency of the epsilon, and step is a set threshold value; taking an absolute value of theta according to the epsilon iteration progress by the function f (t), and specifically determining a specific value of theta by the h (theta) according to the state of the corresponding action; with continuous iteration of the epicode, in the selectable actions, the value of θ corresponding to the action close to the source node becomes smaller, and the value of θ corresponding to the action far from the source node becomes larger, so that a formula D is defined:

the range of the value of the interval D is from 0 to the sum of theta values corresponding to all current optional actions, the interval is divided into n parts, n is the number of the current optional actions, and the length of each subinterval is the theta value of the corresponding action;

η＝random(D)

obtaining a random number eta by carrying out equal probability value on the interval D;

calculated by the following formula:

the function g (D, eta) is the action corresponding to the interval D where the random number eta is located, namely, the agent searches the action a to be selected;

the epsilon-greedy strategy balances exploration and utilization based on an exploration factor epsilon (epsilon is epsilon to [0,1 ]); generating a random number sigma (sigma belongs to [0,1]), and when sigma is less than or equal to epsilon, using a random strategy by agent to explore the environment through randomly selecting actions to obtain experience; when sigma is larger than epsilon, agent uses a greedy strategy to utilize the obtained experience;

R_t(s，a)＝αB-βt+γδ(s_-d)-δ

r is the reward obtained by selecting a link to the node j at the node i; alpha, beta, gamma and delta are used as four positive value parameters to weigh the weights of the four parts of rewards; b is the residual bandwidth of the link corresponding to the selected action, and t is the time delay of the corresponding link; δ (s _ -d) is an excitation function, s _ representing the state to which action a is transferred after selection based on state s;

2. The SDN route convergence method under reinforcement learning based on network characteristics according to claim 1, wherein the fine-grained network area division specifically includes: constructing a network connection matrix according to the SDN network topology, wherein the network connection matrix comprises the adjacency relation among nodes of the network; inputting the network connection matrix and the node number m in the network topology into a hub node election algorithm, recording the connection number of the nodes, and expressing as follows:

wherein, node _ link [ i ] is a node i, T [ i ] [ j ] is a link connected with the node i, and the node with the highest link connection number is selected as a pivot node; the hub node and the adjacent nodes thereof are used as a divided network area.

3. The SDN route convergence method under reinforcement learning based on network features of claim 1, wherein a process of training a Q-value table by the qlearning algorithm is specifically as follows:

setting the maximum step number of single training;

(1) initializing a Q value table and a reward function R;

(4) judging whether s _ is a destination node or not; if not, returning s to (2); if s _ is the destination node, the training ends.

4. The SDN route convergence method under reinforcement learning based on network characteristics of claim 1, wherein link bandwidth performance and link delay are considered when planning the backup path, so that α -0.4, β -0.3, γ -0.1, and δ -0.2 are set.