CN113162800B

CN113162800B - Network link performance index abnormity positioning method based on reinforcement learning

Info

Publication number: CN113162800B
Application number: CN202110270428.8A
Authority: CN
Inventors: 王雄; 顾静玲; 任婧; 徐世中; 王晟
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2022-06-14
Anticipated expiration: 2041-03-12
Also published as: CN113162800A

Abstract

The invention discloses a reinforcement learning-based network link performance index abnormity positioning method, which is used for reflecting a link performance state by acquiring state information of a network node through an INT technology. When the abnormal node is positioned, the service flow routing information obtained from the environment is combined with the user feedback result to generate an environment state, the environment state is input to a Reinforcement Learning (RL) agent to obtain a Q value corresponding to each switch, the largest switch is selected as the abnormal node, the INT function of the abnormal node is opened, then the environment state is input, the abnormal node is positioned again, and the abnormal node is repaired until all service flows fed back by the user are normal. Meanwhile, whether the open switch (positioned abnormal node) is a real abnormal node or not is monitored according to the INT information, and a reward is given and sent to an experience playback pool, so that the Q value output by the decision neural network and the reward value corresponding to the switch are in positive correlation, the decision neural network can more accurately position the abnormal node, the abnormal nodes are monitored in real time by using an INT technology and are not required to be used on all nodes, and the network management overhead is reduced.

Description

Network link performance index abnormity positioning method based on reinforcement learning

Technical Field

The invention belongs to the technical field of network management, and particularly relates to a network link performance index abnormity positioning method based on reinforcement learning.

Background

In order to ensure the normal operation of the network and meet the transmission performance requirements of network applications, the network management system needs to know the operation state inside the network timely, accurately and comprehensively and to troubleshoot network faults timely. The fine-grained network link performance index can truly reflect the operation state inside the network, so the operation state inside the network can be obtained based on the measurement of the network link performance index.

In a traditional network, a statistical characteristic estimation method and an algebraic inversion method are mostly adopted for measuring network link performance indexes. The statistical characteristic estimation method can only estimate the distribution of the network link performance indexes, but cannot obtain the real-time performance indexes of the network links, and the algebraic inversion method has high implementation cost. The emergence of programmable network technology lays the foundation for timely and accurate measurement of network link performance indexes. It has flexible control capability, thereby being beneficial to controlling the detection behavior of the network link performance index.

In recent years, a Network link performance index measurement method which is concerned with much attention is In-band Network Telemetry (INT), which is a novel Telemetry protocol proposed by bareboot, Arista, Dell, Intel, and VMware together, so as to realize real-time data plane monitoring with fine granularity. The INT technology realizes the monitoring of the network state by collecting and reporting the running state in the network on a data plane, and the whole process does not need a control plane, thereby not increasing the burden of a CPU of network equipment. The method allows the data packet to inquire the internal state of the network equipment, and the information comprises the link utilization rate, the internal queue length of the switch and the like, and the link performance can be intuitively reflected through the information.

Fig. 1 shows an example of an INT, where a stream starts from a source H1 and reaches a traffic destination H2 via switches sw1, sw2 and sw 3. In INT, there are three types of nodes: INT source node (INT Src), INT transmit node (INT Transit), and INT Sink node (INT Sink). In the example of fig. 1, the INT source node of the stream is H1, the INT sink is SW3, and the INT transmit nodes are SW1 and SW 2. For each incoming packet, the INT source node H1 inserts an INT header (INT header) specifying the list of INT metadata to be monitored at each hop, i.e., the switch internal state, selecting the switch ID and single hop delay to be monitored. Then, the INT transfer nodes SW1 and SW2 check the INT header and insert the specified INT metadata (i.e., switch ID and single hop delay). The INT receiver node SW3 inserts its INT metadata and forwards the raw data packet and INT header/metadata to the traffic destination H2 and the monitoring engine, respectively. Finally, the monitoring engine processes the INT header/metadata and obtains the network link performance indicators, completing the measurement.

However, there are some drawbacks to using INT technology to obtain (measure) network link performance indicators, appending INT headers and metadata stacks to all packets: 1) and the network overhead problem: INT can consume large network bandwidth, and for small size packets, this network overhead problem can be exacerbated; 2) monitoring system overload problem: for high packet arrival rate traffic, the network monitoring system may be overloaded, resulting in increased processing delay.

There are two solutions to the above problem, 1), reducing the number of bytes of metadata inserted into a data packet, but as the network topology increases, this method still cannot solve the problem; 2) and all nodes in the network selectively insert metadata, and the link performance information collected by the method is limited, so that the whole network state is difficult to monitor.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a reinforcement learning-based network link performance index abnormity positioning method.

In order to achieve the above object, the network link performance index abnormality positioning method based on reinforcement learning of the present invention is characterized by comprising the following steps:

(1) constructing a reinforcement learning agent

The Reinforcement Learning (RL) agent, the controller and the underlying network equipment form a network telemetering closed-loop control system based on user experience;

the underlying network equipment is a switch forming a network and is used as a reinforcement learning environment, and the controller is responsible for converting the decision of the reinforcement learning agent into network operation (action) and sending a corresponding control instruction to the environment so as to cause the environmental state to be transferred; the decision is to open an INT function of one switch, the network operation is to open a control instruction corresponding to the INT function of the switch, and the change of the environmental state is the change of the environmental state caused by the repair of an abnormal node after the INT function of the corresponding switch is opened;

the reinforcement learning agent is served by a terminal server and is an intelligent decision-making system which is open for the function of self-adaptive management of underlying network equipment and comprises a decision neural network, an action filtering module and an experience playback pool;

the input of the decision neural network is an environment state, and the environment state is generated by combining the service flow routing information acquired in the environment with the user feedback result (which service flows have problems);

(2) abnormal node location

The reinforcement learning agent interacts with the environment in discrete time steps (positioning time slots), and in each positioning time slot, abnormal node positioning is carried out:

2.1), initializing the network, not opening any switch INT function, and obtaining the environment state s under the condition of not selecting any network operation (not opening any switch INT function)₀Inputting the Q value vector Q into a decision neural network₀Including n Q values corresponding to the Q values of the corresponding open switches and Q value vector Q₀Sending the vector Q into an action filtering module, and selecting a Q value vector Q by the action filtering module₀The switch corresponding to the middle and maximum Q value is taken as an abnormal node, and corresponding network operation a is sent out₁(control commands for opening corresponding switch INT functions) to the environment, which in turn causes an environment state s₀Transition to ambient state s₁Setting time t to 1;

2.2), for time t, network operation a will be selected_tRear environmental state s_tInputting the vector into a decision neural network to obtain a Q value vector which is expressed as Q(s)_t,a_t) Comprising n Q values corresponding to Q values of the corresponding open switches, and Q value vector Q(s)_t,a_t) Sending the vector to the motion filter module, which selects the Q value vector Q(s)_t,a_t) The switch corresponding to the middle and maximum Q value (if the INT function is opened by the switch, the switch corresponding to the next largest Q value is selected, if the INT function is opened, the switch corresponding to the next largest Q value is further selected until the switch without the INT function is selected), which is taken as an abnormal node and sends out the corresponding network operation a_t+1(control instruction of opening corresponding switch INT function) to environment, monitoring the abnormal node through INT information, if the abnormal node is a real abnormal node, repairing the abnormal node, and further causing environment state s_tTransition to ambient state s_t+1If the node is not a real abnormal node, the switch corresponding to the next largest Q value is further selected as the abnormal node, and then the corresponding network operation a_t+1(control instruction of INT function of corresponding switch is opened) to the environment, and then monitoring and repairing or further selecting, monitoring and repairing are carried out until the environment state s is caused after repairing_tTransition to ambient state s_t+1；

2.3) at the moment t, after the INT function of the switch is opened, the agent calculates the reward r of the network operation (action) according to the INT information collected in the data packet_tConstructing a piece of training data<s_t,a_t,r_t>；

For the award r_tIt is calculated according to the following formula:

where v is the select network operation a_tThen, the switch is monitored to be a real abnormal node through INT information carried in a data packet flowing through the corresponding switch, the number of service flows recovered after the switch is repaired is monitored, and m represents the number of times that the switch is continuously monitored to be not the real abnormal node;

training data<s_t,a_t,r_t>Sending the experience to an experience playback pool;

2.4), judging whether N corresponding switches in the network are all open, if all the switches are open, ending the positioning of the abnormal node, otherwise, further judging whether the network has abnormal network link performance indexes, if not, ending the positioning of the abnormal node, and if so, transferring the environmental state to the environmental state s_t+1If t is t +1, returning to the step 2.2), and continuously positioning the abnormal node;

2.5), empirical playback of training data in the pool<s_t,a_t,r_t>When the quantity reaches a set quantity D, sampling partial data from the experience playback pool to update parameters of the decision neural network, so that a Q value output by the decision neural network is in positive correlation with a reward value corresponding to the switch, and the partial data is determined according to specific conditions;

and after the abnormal node is positioned, waiting for the next positioning time slot to arrive, then executing the steps 2.1) -2.5), aligning the positioning nodes, and updating the decision neural network parameters of the reinforcement learning agent.

The invention aims to realize the following steps:

the invention relates to a network link performance index abnormal bit method based on reinforcement learning, which constructs a Reinforcement Learning (RL) agent consisting of a decision neural network, an action filtering module and an experience playback pool, and forms a network telemetering closed-loop control system based on user experience with a controller and underlying network equipment (as an environment). When the abnormal node is positioned, the service flow routing information acquired from the environment is combined with the user feedback result to generate an environment state, the environment state is input to a Reinforcement Learning (RL) agent to obtain a Q value corresponding to each switch, the largest switch is selected as the abnormal node, the INT function of the abnormal node is opened, and the abnormal node is monitored in real time; and then inputting the environment state, repositioning the abnormal node, and repairing until all the service flows fed back by the user are normal. Meanwhile, whether the open switch (positioned abnormal node) is a real abnormal node or not is monitored according to the INT information, and a reward is given and sent to an experience playback pool, so that the Q value output by the decision neural network and the reward value corresponding to the switch are in positive correlation, the decision neural network can more accurately position the abnormal node, the abnormal nodes are monitored in real time by using an INT technology and are not required to be used on all nodes, and the network management overhead is reduced.

Drawings

FIG. 1 is a schematic diagram of an INT example of the present invention;

FIG. 2 is a schematic diagram of an application scenario of the reinforcement learning-based network link performance index anomaly positioning method according to the present invention;

FIG. 3 is a schematic diagram of a reinforcement learning working model of the DQN algorithm;

FIG. 4 is a flowchart of an embodiment of a reinforcement learning-based network link performance indicator anomaly locating method according to the present invention;

FIG. 5 is a schematic diagram of an application scenario of the reinforcement learning-based network link performance index anomaly positioning method according to the present invention;

FIG. 6 is a schematic diagram of the positioning process of the reinforcement learning-based network link performance index anomaly positioning method of the present invention;

FIG. 7 is a schematic diagram of an embodiment of a decision neural network input;

fig. 8 is a diagram illustrating an exemplary change in environmental conditions.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Network link performance index measurement provides important input for enhancing user experience, and visual feedback of a user using an application program can also be used as an important basis for network management. When a user uses an application, the only concern for the user is the response time of the application, i.e., the end-to-end delay of the service. When the user feeds back that the response time of the current application program exceeds the allowed normal limit, which indicates that the current network is overtime, the network manager should measure the performance index of the network link. In-band network telemetry is a method for monitoring fine-grained state of links in a network, but is limited by bandwidth and data packet overhead, it is unrealistic to open INT for each switch, how to decide (i.e. locate) nodes with abnormal network link performance indexes according to user feedback, and perform fine-grained state monitoring on the abnormal nodes, which is very valuable in network management.

For link performance in a network, latency is often an important indicator of the change in the state of network link performance. The propagation delay and the sending delay of the data packet can be generally regarded as constants, so that the change of the single-hop processing time in the network node (switch) can intuitively reflect the network link performance. The scenario considered by the present invention is as shown in fig. 2, in a programmable network with N nodes, (1) a data plane is composed of a series of programmable switches, receives a control instruction sent from a controller via a southbound interface, and completes operations such as monitoring, processing, and logical forwarding of a data packet; (2) the control plane assumes that only one controller of the network bears the task of connecting the application plane and the data plane, directly transmits a control instruction to the switch, and can communicate with the high-level application management through a northbound interface upwards; (3) there are M users in the programmable network, F application service flows, where the routing path of each service is unique (selecting the shortest path), and the user periodically feeds back the result when using the network application, in this embodiment, it is the feedback satisfaction r_i＝{r₁,r₂,...r_FThat is, after the satisfied service flow has no problem and the unsatisfied service flow has problems, the reinforcement learning agent (terminal server) receives the satisfaction degree from the user, and quickly and accurately selects a certain 'key' node v in the network_iNamely, positioning an abnormal node, informing the programmable switch to open the INT function of the programmable switch through a controller, and determining and opening the abnormal node again after the environmental state is changed.

The invention aims to locate the switch with abnormal network link performance index (such as congestion), and utilizes INT technology to insert abnormal node network state information and observe network state change in real time, thereby reducing unnecessary network management cost.

To solve the above problems, this section constructs a network telemetry closed-loop control system based on Reinforcement Learning (RL), which is an intelligent decision system based on Reinforcement Learning and couples a programmable network (executing control instructions) and an INT technology (collecting network state). An agent (agent) in reinforcement learning can learn in an interactive environment (environment) by repeatedly trying and exploring using feedback of its own behavior and experience. Reinforcement learning is a third task learning mode which can be compared with supervised learning and unsupervised learning, and has the characteristic of generating training data in real time in an interactive mode. Different from other deep learning methods, training data do not need to be prepared in advance for reinforcement learning, and the distribution rule of the environment is dynamically learned by aiming at maximizing positive reward accumulated return expectation according to positive and negative feedback of the environment to behaviors, so that a proper behavior model is found to guide operation.

The reinforcement learning can be generally divided into three categories according to different target function designs: value based, policy based, and operator critic. In the invention, the reinforcement learning algorithm adopts a DQN algorithm, and a reinforcement learning working model based on the DQN algorithm is shown in FIG. 3. DQN (deep Q-Network) is used in many reinforcement learning tasks as one of basic algorithms in the value based method. It uses a neural network to approximate the function, i.e. the input to the neural network is the state and the output is the Q value for each action. After the value function is calculated through the neural network, the agent selects the optimal action according to the output value to take action. The DQN is characterized in that: (1) a deep neural network is used to approximate the behavior value function. (2) A target Q network is used to update a parameter of the deep neural network (target). (3) Experience playback techniques are used, storing the experience of the Agent at each time slot, and participating in subsequent parameter updates and training.

FIG. 4 is a flowchart of an embodiment of a reinforcement learning-based network link performance index anomaly locating method

In this embodiment, as shown in fig. 4, the reinforcement learning-based network link performance index abnormality positioning method of the present invention includes the following steps:

step S1: constructing a reinforcement learning agent

As shown in fig. 5, the Reinforcement Learning (RL) agent, controller, and underlying network devices form a network telemetry closed-loop control system based on user experience.

The underlying network equipment is a switch forming a network and is used as a reinforcement learning environment, and the controller is responsible for converting the decision of the reinforcement learning agent into network operation (action) and sending a corresponding control instruction to the environment so as to cause the change of the environment state; the decision is to open an INT function of an exchanger, the network operation is to open a control instruction corresponding to the INT function of the exchanger, the underlying network equipment is the exchanger forming the network and is used as an environment for reinforcement learning, and the controller is responsible for converting the decision of the reinforcement learning agent into network operation (action) and sending a corresponding control instruction to the environment so as to cause the environmental state to be transferred; the decision is to open an INT function of one switch, the network operation is to open a control instruction corresponding to the INT function of the switch, and the change of the environmental state is caused by the repair of an abnormal node after the INT function of the corresponding switch is opened.

The reinforcement learning agent is an intelligent decision system which is served by a terminal server and is open for self-adaptive management of functions of underlying network equipment, and as shown in fig. 6, the reinforcement learning agent comprises a decision neural network, an action filtering module and an experience playback pool. The decision neural network is composed of a plurality of computation layers, can be regarded as a complex mathematical function, maps a given input into a required output, is the prior art of reinforcement learning, and is not described herein again.

In the invention, the input of the decision neural network is an environment state, and the environment state is generated by combining the service flow routing information acquired in the environment with the user feedback result (which service flows have problems). In this embodiment, the feedback result is: if the maximum delay tolerable by the user is not satisfactory, the service flow of the service used by the user has problems, otherwise, the service flow has no problems.

In the invention, the reinforcement learning agent receives the user feedback result and applies the service routing information to generate the environment state to position the nodes with abnormal network link performance indexes, and the INT state monitoring is carried out on the nodes. For the neural network in DQN (deep Q network), one of the most intuitive input methods is to take the feedback result of the user to all services and the network traffic matrix according to the service flow path routing information as the input of the neural network. The biggest problem of the input setting is that the input scale is too large, the number of nodes (switches) in the network is N, the scale of a traffic matrix is N x N, the service dimension fed back by a user is related to the service volume, and for a medium-scale network with 100 hosts, the number of flows in the network can reach nearly ten thousand at most, and the input has no practical feasibility; when there is less traffic in the network, this input will appear too sparse, which is disadvantageous for the training of the neural network.

Based on the above consideration, in the present embodiment, the input of the decision neural network is set as the number of traffic flows flowing through each node, and whether each node is in the routing path of the problem traffic fed back by the user. According to such a strategy, the input is then a matrix of n x 2, n being the total number of network nodes (switches) traversed by all traffic flow paths in the network. Numbering the switches from 1 to n, corresponding to the inputs of the decision neural network: the row number of the matrix represents the number of the switch, the 1 st row of the matrix represents the number of the services flowing on the switch i, the 2 nd row of the matrix represents whether the services flowing through the switch node i have problem services, 1 represents existence, and 0 represents nonexistence.

In order to simplify the input difficulty, in this embodiment, the user feedback result is determined according to the user satisfaction: and mapping the user satisfaction degrees into 0 and 1, wherein when the user satisfaction degree is 0, namely the user satisfaction degree is unsatisfied, the problem that the delay problem exists in the current service flow is shown, the service is a problem service, and when the user satisfaction degree is 1, namely the user satisfaction degree is satisfied, the problem that the delay problem does not exist in the current service flow is shown, and the service is not a problem service. And judging whether the switch is on the traffic flow path with the user satisfaction degree of 0, if so, marking 1, and otherwise, marking 0. The label is connected with the number of the service flows flowing through each node and then used as the input of the neural network. Although this input solves the problem of large scale, the amount of input information is too small and the relevance of traffic paths is not taken into account, which results in slower neural network learning. Therefore, considering that, on the basis of the input, the source and destination node information of the services is added, and only the source and destination information of the service flows is added to reduce the input scale and reduce the input complexity, at this time, the environment state is a matrix with the dimension of n × 2n +2, n is the total number of network nodes (switches) passed by all service flows in the network through the path, the 1 st-2 nd columns of the input indicate the meaning same as that described above, the 3 rd to (n +2) th columns indicate the source nodes of the service flows passing through the switch i, and if the source node of one of the services is the switch k, the corresponding ith row, the kth +2 nd column is 1; the (n +3) th to (2n +2) th lists show destination nodes of traffic flows flowing through the switch i, and if the destination node of one of the traffic flows is the switch k, the corresponding ith row, the kth + n +2 th list is 1. This input scale is only dependent on the network topology and not on the number of traffic flows in the network.

Input examples as shown on the left side of fig. 7, in a network with 4 nodes, the nodes are numbered 1-4, there are two traffic flows f1 and f2 in the network, the paths are 2- >1- >4 and 1- >2- >3 respectively, and the user feedback service f2 has a delay problem (whether the service has a problem is determined by the user satisfaction, which is a delay problem), the corresponding input is shown on the right side of fig. 7, where the input [ 2111000011 ] in the first row is information of

switch

1, 2 in the first column indicates that the number of traffic flowing through switch 1 is 2, the number 1 in the second column indicates that there is a problem in the traffic flowing through switch node 1, the fourth column in the third column indicates that the source nodes of the two traffic flowing through switch 1 are node 1 and node 2, respectively, and the last two 1 indicate that the destination nodes of the two traffic are node 3 and node 4, respectively.

In this embodiment, a client-server model is adopted to realize interaction between a reinforcement learning agent (RL agent) and an environment, and communication between the two parties is based on a socket. The switch realizes the forwarding of network flow and the remote measurement of a network range, and the remote measurement data is transmitted to a terminal server on an upper layer through the terminal switch. The INT function of the switch is controlled on/off by modifying the flow table within the switch (here using the P4 language of the programmable data plane). When the INT function of the switch is opened, the data packets passing through the switch are inserted into INT tags, and the tags contain internal metadata acquired by the programmable switch, such as the length of an internal queue of the switch, queuing time, outgoing/incoming timestamps and the like. By observing the internal state information of the switches, a network manager can know whether the network nodes, namely the switches are congested or not, so that whether the positioned abnormal nodes are real abnormal nodes or not is judged, and rewards are calculated and sent to an experience playback pool.

Step S2: abnormal node location

The whole abnormal positioning process comprises the following steps:

1. the terminal server (reinforcement learning agent) periodically collects the delay threshold T ═ T that the user can endure to the application (service)₁，T₂,...,T_MM is the number of users, and the end-to-end delay t of each traffic stream in the network is { t ═ t }₁,t₂,...,t_FF is the number of service flows in the network, and when the time delay of the service flows exceeds a time delay tolerable threshold corresponding to a user, the situation that an abnormal node exists at present is indicated, and positioning is carried out;

2. the terminal server obtains whether the environment state is generated according to the time delay of the network service and the routing path information of the network service and the time delay tolerance of the user to be used as the input of the neural network, outputs the node which is possible to be abnormal, namely the abnormal positioning node, and transmits the decision to the controller;

3. after receiving the decision from the terminal server, the controller controls to open the INT function of the abnormal node, and then the data packet passing through the switch is marked with an INT label (the INT label comprises data packet input port time, output port time, link utilization rate, queue length in the switch and the like).

4. And the terminal server receives the INT information in the data packet, judges whether the exchanger is abnormal or not and calculates the reward of abnormal positioning. And when the network has no abnormity, the terminal server stops working and waits for the next abnormal node positioning.

As shown in fig. 6, the main components of reinforcement learning are: agent, action, status, reward, environment. In the invention, the decision of opening/closing INT function of the switch in the network is realized by reinforcement learning, wherein the environment is the whole programmable network consisting of layer network equipment, the agent is served by a terminal server, and the agent can obtain the routing paths of all services in the network, namely service routing information, INT labels carried in data packets and the satisfaction degree of a user on the current network from the environment. The agents interact with the environment at discrete time steps, and in each time slot, the agents observe the environment state (including traffic delay) and the rewards for the corresponding actions. The agent then selects a network operation which is then sent to the environment, which is changed from the current environment state s_tChange to ambient state s_t+1. The goal of a reinforcement learning agent (RL agent) is to obtain as many rewards as possible by choosing the appropriate network operation by an optimal policy.

Specifically, in this embodiment, with reference to fig. 5 and 6, as shown in fig. 4, the reinforcement learning agent interacts with the environment at discrete time steps (positioning time slots), and in each positioning time slot, abnormal node positioning is performed once:

step S2.1: locating the first abnormal node

Initializing the environment state s obtained when the network does not open any switch INT function and no network operation is selected (none of switch INT functions is opened)₀Inputting the Q value vector Q into a decision neural network₀Including n Q values corresponding to the Q values of the corresponding open switches and Q value vector Q₀Sending the vector Q into an action filtering module, and selecting a Q value vector Q by the action filtering module₀The switch corresponding to the middle and maximum Q value is taken as an abnormal node, and corresponding network operation a is sent out₁(control commands for opening corresponding switch INT functions) to the environment, which in turn causes an environment state s₀Transition to ambient state s₁The number of transfers t is set to 1.

Step S2.2: positioning t +1 th abnormal node

For time t, network operation a will be selected_tRear environmental state s_tInputting the vector into a decision neural network to obtain a Q value vector which is expressed as Q(s)_t,a_t) Comprising n Q values respectively corresponding to Q values of the corresponding open switches, and Q value vector Q(s)_t,a_t) Sending the vector to the motion filter module, which selects the Q value vector Q(s)_t,a_t) The switch (t +1 th abnormal node) corresponding to the middle and maximum Q value (if the INT function is opened by the switch, the switch corresponding to the next largest Q value is selected, if the INT function is opened, the switch corresponding to the next largest Q value is further selected until the switch without the INT function is selected), the switch is taken as the abnormal node, and the corresponding network operation a is sent out_t+1(control commands for opening corresponding switch INT functions) to the environment, which in turn causes an environment state s_tTransition to ambient state s_t+1。

In the present invention, the action filtering module needs to filter the network operation. If the INT function is uniformly turned on/off for all ports of the switch, in a network having N switches, when there is one failure point, the size of the action space is N in each network operation. However, after a problem occurs in the network, the number of the problem nodes cannot be known, and the action space is an arbitrary subset of N switches, and the size of the action space is N powers of 2. And if INT function control is performed for each port of the switch, the operation space is increased to the sum of the number of ports of all switches in the case of one failure point, that is, 2 × e, e is the number of links of the network. Such exponential growth inevitably leads to motion space explosion, making training time too long.

In the present invention, to reduce the complexity of the positioning, a reinforcement learning agent (RL agent) specifies that an INT function of only one switch can be opened at one network operation (action) selection, and the operation on the switch is directed to all its ports. If the reinforcement learning agent chooses to open INT function of a certain switch after a decision, then all the data of the traffic flow flowing through the switch are passed throughThe packet will be inserted into the INT label. In the case where there are a plurality of abnormal nodes, the action space is from { s }₁_on,s₂_on,...,s_NN power of 2 of any subset of _ on } becomes N, since the size of the subset is limited to only 1, the motion space becomes s₁_on,s₂_on,...,s_NOn, and the size is N.

In addition, the network monitoring concerns here the traffic flows where the user feedback is problematic, and there are some nodes in the network: no traffic flows pass through these nodes and the load is empty. Therefore, when the user feeds back the application program with the problem, the congestion can not occur on the switches with the empty load, the monitoring of the switches has no significance for network management, so that the action of opening INT functions of the switches is selected to be filtered out from the action space, and an available operation set path _ set is used for representing the network nodes { s } where all the traffic flow passes through the routing path in the current network₁,s2…,s_nWhere N is the total number of network nodes (switches) traversed by all traffic flow paths by the path, then the filtered action space is defined by the size N { s ∈ V₁_on,s₂_on,...,s_NReduction from __ to s₁_on,s₂_on,...,s_n_on}。

To avoid duplication of actions by agents, i.e., to repeatedly open INT functions of a switch, an optional set of actions mask { s } is maintained in the environment₁,s₂,...,s_nIn which s is₁,s₂,...,s_nThe value of (1) is 0 or 1, 0 indicates that the switch has not been monitored, and 1 indicates that the switch has been monitored. The environment passes this variable to the agent, thereby reducing the determination of useless actions and speeding up the learning process.

In the invention, each environmental state transition is accompanied with the opening of INT function of a switch, whether congestion exists in the switch is determined according to INT label, generally, when the internal queue length q > of the switch is 80% max _ q, the switch has congestion, max _ q represents the longest queue length of the switch, and besides, whether the switch has congestion can be judged according to switch single-hop processing time delay and link utilization rate. When a switch is determined to be a problem node, a network link performance index abnormal node is positioned, then the time delay of the service flow flowing through the switch node is subtracted by the single-hop time delay of the switch, the abnormal node is eliminated, and if a user is still unsatisfied, the network still has the abnormal node. (the single-hop processing delay of the switch is the incoming-outgoing timestamp carried in the INT tag).

The input information of the neural network is obtained from the routing information (routing path) of the service flows in the network and the satisfaction degree (user feedback result) of the user on the service flows, the network operation is carried out, and the environment state is obtained from s after the real abnormal nodes are repaired_tIs transferred to s_t+1For each switch, it is necessary to determine again whether there is still any problem traffic in the traffic flow flowing through the switch, and therefore the input of the decision neural network will be changed. Taking the input in fig. 7 as an example, if both user feedback services f1 and f2 have problems. The switch node 3 is opened with INT function and found to be an abnormal node by monitoring, and if the switch node 3 is excluded (repaired), the traffic f2 is recovered to be normal. As shown in fig. 8, then (a) is input, i.e. the ambient state s_tBecomes input (b), i.e., environmental state s_t+1I.e. the 1 of the third row and second column becomes 0.

Step S2.3: calculating rewards and constructing training data

In order to locate nodes with abnormal network link performance indexes, network nodes (switches) as few as possible are selected to be inserted into INT labels, and meanwhile, whether a network problem is solved or not can be analyzed in real time according to the INT labels. The fewer nodes with the open INT function are selected, the fewer available information in the network is acquired; on the contrary, the more nodes with the INT function are opened, the more local state information in the network equipment is acquired, and the more complete the whole network view is, but the more the network telemetry overhead is, and the more the monitored network nodes are not nodes needing attention, the information redundancy is generated. Considering the relevance of traffic flow path, if a network node n has multiple problem traffic flows through it at the same time, it is selected through trainingThe INT function of the network node is put, and it is determined that the node is a true abnormal node according to the collected switch state information (such as queue length, queue delay, etc. in the switch), and the response speed of multiple traffic flows is improved after the true abnormal node is repaired, so that the action is valuable, and the corresponding reward should be set higher. And the reward mechanism in reinforcement learning is not too complex, and simple and clear reward is beneficial to guiding the reinforcement learning agent to select network operation (action) for improving the current environment. Based on the principle, at the moment t, after the INT function of the switch is opened, the agent calculates the reward r of the network operation (action) through the INT information collected in the data packet_tConstructing a piece of training data<s_t,a_t,r_t>。

For the award r_tIt is calculated according to the following formula:

where v is the select network operation a_tAnd then, monitoring that the switch is a real abnormal node through INT information carried in the data packet flowing through the corresponding switch, and repairing the number of the service flows recovered after the switch. The importance of the nodes can be displayed by taking the node as a positive reward, the node is mostly appeared at the intersection point of the traffic flow routing path, the nodes transmit a plurality of traffics, the load is heavier, and the probability of occurrence of problems is larger than that of other nodes. m represents the number of times that the switch is continuously monitored as not being a true abnormal node, causing unnecessary telemetry overhead, which is fed back as a negative reward to the reinforcement learning agent to prevent such "bad behavior" from occurring again.

Training data<s_t,a_t,r_t>And sending the experience playback pool.

Step S2.4: judging whether to finish

Judging whether N corresponding switches in the network are all open, if all the switches are open, ending the abnormal node positioning, otherwise, further judging whether the network has the abnormal nodesThe network link performance index is abnormal, if no abnormal, the abnormal node positioning is finished, if abnormal, the environment state is transferred to the environment state s_t+1And t +1, returning to step 2.2), continuing to locate the abnormal node, and returning to step S2.2.

Step S2.5: updating parameters of a decision neural network based on training data

Training data < s in experience replay pool_t,a_t,r_t>When the quantity reaches a set quantity D, sampling partial data from the experience playback pool to update parameters of the decision neural network, so that a Q value output by the decision neural network is in positive correlation with a reward value corresponding to the switch, and the partial data is determined according to specific conditions;

and after the abnormal node is positioned, waiting for the next positioning time slot to arrive, then executing the steps S2.1-S2.5, positioning the node, and updating the decision neural network parameters of the reinforcement learning agent.

The network link performance index abnormity positioning method based on reinforcement learning can accurately position the abnormal nodes, so that INT technology is used for real-time monitoring on the abnormal nodes, and INT technology is not needed to be used for real-time monitoring on all the nodes, thereby reducing network management overhead.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A network link performance index abnormal positioning method based on reinforcement learning is characterized by comprising the following steps:

(1) constructing a reinforcement learning agent

The reinforcement learning agent, the controller and the underlying network equipment form a network telemetering closed-loop control system based on user experience;

the underlying network equipment is a switch forming a network and is used as a reinforcement learning environment, and the controller is responsible for converting the decision of the reinforcement learning agent into network operation and sending a corresponding control instruction to the environment so as to cause the environmental state to be transferred; the decision is to open an INT function of one switch, the network operation is to open a control instruction corresponding to the INT function of the switch, and the change of the environmental state is the change of the environmental state caused by the repair of an abnormal node after the INT function of the corresponding switch is opened;

the input of the decision neural network is an environment state, and the service flow routing information acquired in the environment is combined with the user feedback result, namely which service flows have problems to generate the environment state;

(2) abnormal node location

The reinforcement learning agent interacts with the environment in discrete time steps, namely positioning time slots, and in each positioning time slot, abnormal node positioning is carried out for one time:

2.1), initializing the network, not opening any switch INT function, and obtaining the environment state s without selecting any network operation, i.e. not opening any switch INT function₀Inputting the Q value vector Q into a decision neural network₀Including n Q values corresponding to the Q values of the corresponding open switches and Q value vector Q₀Sending the vector Q into an action filtering module, and selecting a Q value vector Q by the action filtering module₀The switch corresponding to the middle and maximum Q value is taken as an abnormal node, and corresponding network operation a is sent out₁I.e. opening the control command corresponding to the INT function of the switch to the environment, and then causing the environment state s₀Transition to ambient state s₁Setting time t to 1;

2.2), for time t, network operation a will be selected_tRear environmental state s_tInputting the Q value vector into a decision neural network to obtain a Q value vectorRepresented by Q(s)_t,a_t) Comprising n Q values corresponding to Q values of the corresponding open switches, and Q value vector Q(s)_t,a_t) Sending the vector to the motion filter module, which selects the Q value vector Q(s)_t,a_t) The switch corresponding to the largest Q value: if the INT function is opened by the switch, selecting the switch corresponding to the next largest Q value, if the INT function is opened, further selecting the switch corresponding to the next largest Q value until the switch without the INT function is selected, taking the selected switch as an abnormal node and sending out the corresponding network operation a_t+1Opening the control command corresponding to INT function of the switch to the environment, monitoring the abnormal node through INT information, if the abnormal node is a real abnormal node, repairing the abnormal node, and further causing the environment state s_tTransition to ambient state s_t+1If the node is not a real abnormal node, the switch corresponding to the next largest Q value is further selected as the abnormal node, and then the corresponding network operation a is sent_t+1Opening the control instruction corresponding to INT function of the switch to the environment, and then monitoring, repairing or further selecting, monitoring and repairing until the environment state s is caused after repairing_tTransition to ambient state s_t+1；

2.3) at the moment t, after the INT function of the switch is opened, the agent calculates the reward r of the network operation according to the INT information collected in the data packet_tConstructing a piece of training data<s_t,a_t,r_t>；

For the award r_tIt is calculated according to the following formula:

where v is the select network operation a_tThen, the switch is monitored to be a real abnormal node through INT information carried in the data packet flowing through the corresponding switch, the number of the service flows recovered after the switch is repaired is monitored, and m represents that the switch is continuously monitored not to be realThe number of positive abnormal nodes;

2. The reinforcement learning-based network link performance index anomaly locating method according to claim 1, wherein the user feedback result is determined according to user satisfaction: mapping the user satisfaction to be 0 and 1, when the user satisfaction is 0, namely unsatisfied, the problem that the delay problem exists in the current service flow is shown, the service is a problem service, when the user satisfaction is 1, namely, satisfied, the problem that the delay problem does not exist in the current service flow is shown, and the service is not a problem service;

the environment state is a matrix with the dimension of n × 2n +2, n is the total number of network nodes, namely switches, through which all service flow paths pass in the network, the switches are numbered according to 1 to n, the row number of the matrix represents the switch number, the 1 st row of the ith row represents the number of the services flowing on the switch i, the 2 nd row of the ith row represents whether problem services exist in the services flowing through the switch node i, 1 represents the existence, 0 represents the nonexistence, the 3 rd to (n +2) columns represent the source nodes of the service flows flowing through the switch i, and if the source node of one service is a switch k, the k +2 th row of the corresponding ith row is listed as 1; the (n +3) th to (2n +2) th lists show destination nodes of traffic flows flowing through the switch i, and if the destination node of one of the traffic flows is the switch k, the corresponding ith row, the kth + n +2 th list is 1.