CN115150335A

CN115150335A - Optimal flow segmentation method and system based on deep reinforcement learning

Info

Publication number: CN115150335A
Application number: CN202210757920.2A
Authority: CN
Inventors: 黄东东; 毛斐; 强小应; 王建
Original assignee: Fiberhome Telecommunication Technologies Co Ltd; Wuhan Fiberhome Technical Services Co Ltd
Current assignee: Fiberhome Telecommunication Technologies Co Ltd; Wuhan Fiberhome Technical Services Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-04
Anticipated expiration: 2042-06-30
Also published as: CN115150335B

Abstract

The invention relates to the field of communication, in particular to a method and a system for optimal flow segmentation based on deep reinforcement learning. The method mainly comprises the following steps: acquiring the current states of all routers needing flow segmentation, and establishing a router state matrix; iteratively establishing a training data set by using all state matrixes of the routers after each flow segmentation, wherein the output of the training data set is a state quadruple of a full network link; training a reinforcement learning model by taking the sum of the average throughput of each router in a specified number of cycles after flow planning as Reward, and performing flow segmentation on each router by using the trained model; and after the data packet is sent to the next router node according to the flow segmentation, acquiring a router state matrix at the next moment, carrying out iterative updating on the training data set, and retraining the model by using the updated training data set. The invention can realize dynamic planning according to the change of all links in the network and obtain the flow segmentation scheme with the highest long-term benefit.

Description

Optimal flow segmentation method and system based on deep reinforcement learning

Technical Field

The invention relates to the field of communication, in particular to an optimal flow segmentation method and system based on deep reinforcement learning.

Background

With the continuous expansion of network scale, the routing of data packets transmitted in the network becomes complicated, and when a certain node or link is congested due to burst traffic, the traffic needs to be divided into different nodes. How to set the traffic split ratio according to the weight of the route is crucial for both the router and the entire network.

The network topology shown in fig. 1 is a weighted graph, each circle represents a router node, a connection line represents a link that is reachable between two router nodes, and the weight of the connection line is the weight of traffic allocated to the link. The traditional method is mainly to artificially configure the division ratio of each route. When the network environment is changed complexly, the fixed division ratio can not meet the network requirement, the linkage between routers in the network can not be evaluated, and the performance of the whole network can not be evaluated. Therefore, as network users increase, the network environment will become more complex, and how to dynamically generate the optimal traffic split ratio according to the real network state becomes a challenge in future technologies.

In view of this, how to overcome the defects in the prior art and solve the problem that the optimal traffic division ratio changes when the network environment dynamically changes is a problem to be solved in the technical field.

Disclosure of Invention

In response to the above deficiencies in the art or needs in the art, the present invention addresses the problem of optimal traffic segmentation in a dynamic network environment.

The embodiment of the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for optimal traffic segmentation based on deep reinforcement learning, specifically comprising: acquiring the current states of all routers needing flow segmentation, and establishing a router state matrix; iteratively establishing a training data set by using all router state matrixes obtained after each flow segmentation, wherein the output of the training data set is a state quadruplet of a whole network link, and the state quadruplet is specifically [ a router state matrix, a weight vector, average throughput and a next-time router state matrix ], wherein the weight vector is a proportion for each next-hop router flow segmentation; training a reinforcement learning model by taking the sum of the average throughput of each router in a specified number of cycles after flow planning as Reward, and performing flow segmentation on each router by using the trained model; and after the data packet is sent to the next router node according to the flow segmentation, acquiring a router state matrix at the next moment, carrying out iterative updating on the training data set, and retraining the model by using the updated training data set.

Preferably, the reinforcement learning model specifically includes: the strategies of the reinforcement learning model comprise learning based on a strategy Actor and learning based on a value function Q, the input of the reinforcement learning model is a router state matrix of the whole network, the output of the Actor is a set of all possible weight vectors from a current router to adjacent routers, and Q is a score obtained according to the sum of average throughputs of all routers of the whole network after flow segmentation.

Preferably, the weight vector specifically includes: for each router, the value range of the weight vector of the flow division is [0,1], and the sum of all the weight vectors is 1.

Preferably, the iteratively establishing the training data set by using all the router state matrices after each traffic segmentation specifically includes: randomly selecting a group of weight vectors as initial weight vectors, and taking state quadruples corresponding to the initial weight vectors as initial data of a training data set; and after the router transmits the data packet to the next node each time, adding the state quadruple of the whole network topology into the training data set, and performing iterative updating on the training data set.

Preferably, the iteratively establishing the training data set by using all the router state matrices after each traffic segmentation further includes: generating random numbers in at least one [0,1] interval for each router, the sum of all the generated random numbers being 1, and using the generated random numbers as initial values of the weight vector of the router.

Preferably, the method, with the sum of average throughputs of each router in a specified number of cycles after traffic planning being Reward, specifically includes: and taking the average throughput of all routers as the Reward of the current period, and taking the average throughput of the whole network of each period in a specified number of periods after each traffic division as the Reward of the traffic division.

Preferably, the establishing a router state matrix specifically includes: the method comprises the steps of obtaining a state vector of each router, and forming a state matrix by the state vectors of all the routers, wherein each row in the state matrix is the state vector of one router, each column is a field in the state vector, and the state vector of each router comprises at least two items of time period, bandwidth, current load, time delay, speed and configuration index of the current time point of each router.

Preferably, the training of the reinforcement learning model specifically includes: pre-training the reinforcement learning model by using offline data; and/or, performing online iteration on the reinforcement learning model by using real-time data.

On the other hand, the invention provides a system for optimal flow segmentation based on deep reinforcement learning, which specifically comprises the following steps: the controller acquires the states of all routers, inputs the state matrixes of all the routers into the trained reinforcement learning model, acquires the output weight vector according to the method provided by the first aspect, and sends the output weight vector to the routers; the router performs flow segmentation according to the received weight vector, sends the data packet to the next node and sends the self state to the controller; and the controller generates a state matrix of the router at the next moment by using the latest state of the router, adds a state quadruple containing the state matrix at the next moment into a training data set, and trains the reinforcement learning model by using the updated training data set.

Preferably, the acquiring, by the controller, the states of all routers specifically includes: the controller sends router state vector acquisition instructions to all routers, each router sends the current state vector of the router to the controller, and the controller generates a state matrix of the whole network router according to the state vectors of all the routers; the method comprises the following steps that a controller sends link state acquisition instructions to all routers, each router sends the current link state of the router to the controller, and the controller determines an Action Set of a reinforcement learning model according to the link state of the whole network router; when the router performs flow segmentation according to the weight vector, the data packet is transmitted to the next node and then is reported to the controller, the controller sends a throughput submitting instruction and a router state vector collecting instruction to the reported router, the reported router sends the current throughput and the state matrix of the reported router to the controller, and the controller generates the router state matrix of the next moment.

Compared with the prior art, the embodiment of the invention has the beneficial effects that: and (3) performing traffic segmentation of all routers according to the state matrix of the full-network router by using a deep reinforcement learning model, acquiring a traffic segmentation strategy with the maximum average throughput in a specified number of cycles, and iterating the training data set according to a new state matrix obtained after each segmentation. By the method, the optimal dynamic planning of traffic segmentation can be realized according to the current state changes of all links in the network and the iteration of the training data set, and the sum of the average throughputs of a plurality of periods is taken as a planning target to obtain a traffic segmentation scheme with the highest long-term benefit.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic diagram of a network structure topology used in a specific scenario according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for optimal traffic segmentation based on deep reinforcement learning according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a state matrix of a router at a certain time in a specific scenario according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a set of actions at a particular time in a particular scenario according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a multi-step stored data model used in an embodiment of the present invention;

fig. 6 is a flowchart of another method for optimal traffic segmentation based on deep reinforcement learning according to an embodiment of the present invention;

FIG. 7 is a diagram of a reinforcement learning model used in an embodiment of the present invention;

FIG. 8 is a diagram illustrating an Actor-Critic algorithm model used in an embodiment of the invention;

FIG. 9 is a schematic diagram of a neural network model used in an embodiment of the present invention;

fig. 10 is a schematic diagram of a system architecture of optimal traffic segmentation based on deep reinforcement learning according to an embodiment of the present invention;

fig. 11 is a system operation timing diagram of optimal traffic segmentation based on deep reinforcement learning according to an embodiment of the present invention;

fig. 12 is a flowchart of a system for optimal traffic segmentation based on deep reinforcement learning according to an embodiment of the present invention;

fig. 13 is a flowchart of a system operation of optimal traffic segmentation based on deep reinforcement learning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

The present invention is a system structure of a specific function system, so the functional logic relationship of each structural module is mainly explained in the specific embodiment, and the specific software and hardware implementation is not limited.

In addition, the technical features involved in the respective embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The invention will be described in detail below with reference to the figures and examples.

Example 1:

most of the traditional traffic division is based on a network topology structure, and in the big data era, the traffic division based on data driving becomes a big hotspot. The traffic segmentation method provided by the embodiment relies on big data generated by the network itself, learns traffic characteristics contained in the network environment by adopting a deep reinforcement learning algorithm, dynamically calculates the optimal traffic segmentation proportion of the whole network, and can effectively reduce network delay and improve network throughput.

As shown in fig. 2, the method for optimal flow segmentation based on deep reinforcement learning according to the embodiment of the present invention includes the following specific steps:

step 101: and acquiring the current states of all routers needing to be subjected to flow segmentation, and establishing a router state matrix.

To obtain a traffic split ratio optimized for the entire network, it is necessary to split and evaluate the state of each router (router) in the network by the root. Therefore, the state vector of each router is acquired first. In a specific implementation, the parameters of the state vector may select various attributes related to path selection, such as: and each router is at least two items of time period, bandwidth, current load, time delay, rate and configuration index of the current time point.

In order to facilitate uniform processing of the neural network model, after the state vectors of each Router are obtained, the state vectors of all the routers can be combined into a state Matrix (Router Matrix), each row in the state Matrix is a state vector of one Router, each column is a field in the state vector, each Router corresponds to one state vector, the state vectors of a plurality of routers are combined into the state Matrix, and the state Matrix of the Router is used as the input of deep reinforcement learning. The state matrix information includes the state information of all routers in the whole network and also includes the characteristic of flow division when the data packet is transmitted. The deep learning algorithm is to extract the characteristics of the data and map the state matrix of the router to the optimal flow division ratio.

As shown in fig. 3, the attribute meaning of each field is as follows for the router state matrix at a specific time in the usage scenario.

Router _ ID: the router ID. For uniquely identifying the router.

Timestamp: the time period of the current time point. In specific implementation, time periods can be divided as required to be used as cycles of path planning, state acquisition and return evaluation. For example, a day may be divided into 24 periods, at which time the Timestamp takes the value [1,24].

Bandwidth: bandwidth of the router.

Load: the current load of the router, the specific value, may be a percentage of the maximum load, taking the value of [1,100].

Time _ Delay: time delay data of the router.

And (3) Rate: rate data of the router.

Performance configuration indexes of the router, such as CPU computing power, available storage space size and the like, can be quantitatively expressed by using the comprehensive evaluation value.

To facilitate the calculation, the reinforcement learning model is input and the data format of the router state matrix stored in the training data set is as follows.

With each action being the current state vector of one router.

m represents the number of routers in the network, and the value is a positive integer.

n, representing the dimension number of the current state information of each router, and being consistent with the dimension number of the state vector.

Xmn represents an attribute value of the current router.

In order to avoid the conflict between the training data set and the traffic segmentation data, for each state matrix, the sequence of each row of routers and the sequence of the router state vectors in each row need to be consistent when model training and actual traffic segmentation are performed. For example, the bindings are arranged in the order of Router _ IDs and the state vector for each Router is stored in a fixed data structure. When the number of routers in the network changes, the reinforcement learning model needs to be retrained by using the updated state matrix, and then traffic segmentation is performed by using the trained model.

Step 102: and iteratively establishing a training data set by using all router state matrixes obtained after each flow segmentation, wherein the output of the training data set is a state quadruple of the whole network link.

In the method provided in this embodiment, it is necessary to perform partition calculation on the available traffic of each router according to the value of each state vector in the router state matrix, use the proportion of the traffic allocated to each adjacent routing node by the router after the partition as a calculation result, use the average throughput of the whole network as a criterion for evaluating the traffic partition result, and when training and predicting by using a reinforcement learning model, it is necessary to include the data. In addition, in the method provided in this embodiment, in order to enable the traffic distribution of each router to provide future benefits for a longer period as much as possible, not only the instantaneous network state after traffic division needs to be paid attention, but also the router states of several subsequent periods need to be paid attention, and the states of multiple periods are comprehensively evaluated, so that a router state matrix at the next moment after traffic division is performed needs to be acquired as an input of traffic division prediction of the next period, and the accuracy of path planning is improved through the comprehensive evaluation of multi-period benefits. Therefore, in this embodiment, the state quadruplet used by the training data set is specifically [ router state matrix, weight vector, average throughput, router state matrix at the next time ]

In the state quadruplet, the weight vector (weight vector) is the proportion of each next-hop router traffic split. In order to fully utilize all traffic resources of each router, the available traffic of each router is therefore allocated as 1. After traffic segmentation, for each router, the weight value range of traffic segmentation is [0,1], and the sum of all weights is 1.

Step 103: and training a reinforcement learning model by taking the sum of the average throughput of each router in a specified number of cycles after the flow planning as Reward, and performing flow segmentation on each router by using the trained model.

Reward (Reward) of reinforcement learning is an important index in an algorithm, and an appropriate Reward is selected to enable a model to be converged as soon as possible. In the method provided by this embodiment, an average throughput (throughput) of all routers in the network is used as a return, that is, an average of throughputs of all routers after a packet is shunted from each router to an adjacent router is used as a criterion for evaluating a traffic segmentation result, and it is not necessary to pay attention to whether the packet has reached a target node. The higher the average throughput of the whole network indicates that the network transmission efficiency is higher under the traffic division ratio, so the average throughput of all routers can be regarded as the Reward of the current period. And after the data packet is transmitted to the adjacent router from each router according to the distribution proportion after the current flow is divided, the average throughput of all the routers is used as the return of the next state. In order to maximize future long-term benefits after the reinforcement learning selects the Action of one traffic partition, the sum of all average throughputs of t + n +1 cycles in the future is used for benefit evaluation, and the average throughput of the whole network in each cycle in a specified number of cycles after each traffic partition is used as the Reward of the traffic partition. In a specific implementation, the number n of cycles for performing benefit evaluation may be determined according to actual needs.

In reinforcement learning, the agent selects an Action (Action) according to the current environment state, and the Action belongs to an Action Set (Action Set). In the method provided by this embodiment, the current action set includes the weight sets of all possible routers from the current router to the next hop, i.e. the weight vectors from the current router to all neighboring routers.

As shown in fig. 4, the set of actions is the traffic split ratio from router ID1 to all neighboring routers. Each value in the weight vector represents the proportion of flow division, and the value range is [0,1]. The meanings of the attributes in the figure are as follows.

S _ RouterID: the source router ID.

N _ Router _ W: the corresponding weight of the neighboring router.

In FIG. 4, three traffic splitting strategies for 4 neighboring routers are included, including the source router ID1, [0.3, 0.2], [0.4,0.15,0.25,0.2], [0.18,0.32,0.28,0.22], and the sum of all traffic weights of the three splitting modes is 1.

In order to facilitate policy selection, in a specific implementation scenario of this embodiment, a policy of a reinforcement learning model includes learning based on a policy Actor and learning based on a value function Q, an input of the reinforcement learning model is a router state matrix of the whole network, an output of the Actor is a set of all possible weight vectors from a current router to an adjacent router, and Q is a score obtained according to a sum of average throughputs of all routers of the whole network after traffic segmentation.

When the flow is divided, AI of reinforcement learning randomly selects an Action from the Action Set to execute, and evaluates the execution result according to the Reward, and finally outputs the Action with the maximum benefit, namely the flow division scheme with the maximum average throughput in the designated period of the whole network. Each algorithm corresponds to a flow division scheme of one router and outputs a group of flow division weight vectors.

Step 104: and after the data packet is sent to the next router node according to the flow segmentation, acquiring a router state matrix at the next moment, carrying out iterative updating on the training data set, and retraining the model by using the updated training data set.

In the method provided by this embodiment, reward for reinforcement learning is a long-term benefit, which represents a return obtained by the current Actor at the time t + n +1 after one Action is selected according to the current environment. Therefore, when sample data is acquired, the average throughput of the time from t +1 to t + n +1 needs to be stored in the training data set, so as to perform iterative computation. For convenience of storage and use, in an actual implementation scenario, a multi-step storage data model as shown in fig. 5 may be used, and state quadruples corresponding to each period are stored in a database, so that data at a time from t to t + n +1 may be conveniently taken out from the database in batch during model training and path planning.

After the steps 101 to 104 provided in this embodiment, an AI algorithm can be used to dynamically generate an optimal traffic segmentation strategy according to the real-time state of the current network, so as to effectively reduce network delay, reduce bit error rate, improve throughput, and obtain the optimal network planning effect in a longer period.

In the method provided in this embodiment, because the data in t + n +1 cycles needs to be used for revenue evaluation, when the training data set is established in step 102, not only the data in the current t cycle needs to be used, but also all the data in the t cycle to t + n +1 cycles need to be used, and therefore, after the traffic segmentation is performed each time and the data transmission is completed, iterative update needs to be performed on the data in the training data set. Specifically, as shown in fig. 6, a training data set is iteratively established by using all router state matrices obtained after traffic segmentation at each time, specifically including.

Step 201: and randomly selecting a group of weight vectors as initial weight vectors, and taking state quadruples corresponding to the initial weight vectors as initial data of a training data set.

When a reinforcement learning model is used for carrying out flow segmentation, weight vectors from a current router to all adjacent routers are used as Action Set, a group of weight vectors are randomly selected as Action to start evaluation, and a state quadruple corresponding to a router state matrix in a current t period is used as initial data of a training data Set.

Step 202: and after the router transmits the data packet to the next node each time, adding the state quadruple of the whole network topology into the training data set, and performing iterative update on the training data set.

When the path is planned, the time point for obtaining the Reward is when the data packet reaches the adjacent router. For the traffic segmentation, the next time, that is, the t +1 period, refers to a state matrix obtained again after the data packet has been sent from the current router node to the next router node, that is, the state matrix of the router at the next time in the state quadruplet. Step 202 is repeatedly executed until the state quadruple of the t + n +1 period is acquired, that is, the acquisition of a group of training data is completed. In the actual operation process, when path planning is performed on the next period, t = t +1, data of the t period which is not needed to be used is removed from the training data set, and the latest data of the (t + 1) + n +1 period is added, so that a new training data set can be obtained.

Through the steps 201 to 202, the training data set can be obtained iteratively, and long-term evaluation of multi-cycle income is achieved.

Furthermore, for the obtained training data, preprocessing can be performed by using methods such as 0 mean, normalization, PCA (principal component analysis), whitening and the like, so that the accuracy and the training efficiency of the training data set are further improved.

In this embodiment, because the network dynamically changes, the training data set also dynamically changes according to the result of path planning in each period, and therefore, the reinforcement learning model needs to be trained for multiple times by using the training data set that iterates over time.

(1) Before actual use, the reinforcement learning model can be pre-trained by using offline data so as to improve the path planning efficiency and accuracy in the initial stage. When training data is generated, N random numbers with the sum of 1 can be generated through a random function, and each random number is one value of the weight vector and represents the flow segmentation proportion of adjacent nodes. In practical use, in order to prevent too many random numbers from being generated, each random number may be set to one bit after the decimal point. Since the sum of all weight vectors must be 1, the same number of weight vectors as the number of neighboring routers can be generated using the following rule. When the number of adjacent routers is n: the first random fraction generated = random (0, 1), the second random fraction generated = random (0, 1-the first random fraction generated), the third random fraction generated = random (0, 1-the first random fraction generated-the second random fraction generated) \8230; the nth random fraction generated = random (0, 1-the first random fraction generated-the second random fraction generated \8230; _ 8230; — (n-1) the th random fraction generated).

(2) When the path is planned actually, in order to ensure that the planned path is consistent with the actual situation, the reinforcement learning model needs to be iterated online by using real-time data. When the flow division is carried out, each model corresponds to one router node. If a network has 100 nodes, only 100 models are needed for calculation, and the number of models is relatively small, so that in a common implementation scenario, in order to ensure the accuracy of traffic segmentation, real data can be used as a training data set. In some special scenarios, the data set may also be extended using GAN or the like as needed.

The two methods are used independently and in combination, so that training of a reinforcement learning model can be completed, and a flow segmentation strategy meeting the maximum expected benefit can be planned.

According to the optimal traffic segmentation method based on deep reinforcement learning, the routing characteristics contained in the network environment are learned by adopting a deep reinforcement learning algorithm based on big data generated by the network, so that the traffic segmentation strategy of the whole network is dynamically calculated, the network delay can be effectively reduced, and the throughput of the network can be improved.

Example 2:

in the method of traffic segmentation based on deep reinforcement learning provided in embodiment 1, deep reinforcement learning is performed using a reinforcement learning model. In this embodiment, some available specific configuration methods and parameters of the neural network model are provided. In practical implementation, specific neural network model selection and configuration can be performed according to practical needs by referring to the following parameters.

In step 103, the weight vector of the flow rate division is learned using reinforcement learning. Using the reinforcement learning model shown in fig. 7, after an Agent (Agent) in reinforcement learning interacts with an Environment (Environment) and takes an Action (Action), the current State changes to the State (State) of the next time point and gets a Reward (Reward). If the action taken is Positive (Positive) and a higher reward is obtained, the agent will increase the probability of selecting this action the next time a similar context is encountered. Conversely, if the action taken is Negative (Negative) and a lower reward or Negative score is obtained, the agent will decrease the probability of selecting this action the next time a similar context is encountered.

In a preferred embodiment, an Actor-Critic algorithm can be used to learn the optimal traffic segmentation, but is not limited to this method. As shown in the schematic diagram of FIG. 8, actor-Critic is a player Critic model that integrates the advantages of "strategy-based learning" and "value function-based learning". The Actor is responsible for learning the strategy, the Q function is responsible for scoring the learned strategy, and the higher the score is, the better the strategy learned by the Actor is represented. And during training, updating the parameter of the Q according to gradient rise, updating the parameter of the Actor by using the parameter learned by the Q, and updating the parameter of the Q by using the parameter of the Actor until the whole network converges.

In a specific implementation, a Neural Network (NN) may be used as the network of actors and Q. As shown in the network structure diagram of fig. 9, actors and Q are one large network. The input of the whole model is a Router state Matrix (Router Matrix), and the output of the Actor is one of Action sets, i.e. a set of weight vectors. The output of Q is the score, with higher scores being better. When the model converges, the algorithm can select a set of weight vectors based on the current input (router state matrix) to cause Q to output a larger value. That is, the algorithm finds a group of traffic partitioning strategies according to the current network environment, so that the return (average throughput) obtained by the network is the maximum. During training, the parameters of Q are estimated, and then the parameters of the Actor are updated by fixing the Q. Actor is updated using a gradient ascent algorithm, i.e.

When calculating action, a noise parameter method is adopted. I.e. using Q network parameters plus noise

To generate action. This has the advantage of allowing the model to learn actions systematically rather than randomly. The action learned at this time is let

And (4) maximizing.

Further, because the method provided in embodiment 1, the Reward of reinforcement learning is a long-term benefit, which represents the return obtained at the time t + n +1 after the current actor selects an action according to the current environment. Therefore, when acquiring sample data, a multi-step stored data model as shown in fig. 5 may be used, and the average throughput at time t +1 and time t + n +1 needs to be stored in the database. This has the advantage that it is convenient to batch-fetch the data from t to t + n +1 from the database when training the model.

At the current moment r _t For the average throughput of the network after the packet is transmitted from the source router to the target router, r at time t + n +1 _t+n+1 The average throughput of the network after the packet is transmitted from the source router to the target router. The objective function in this case is to make Q(s) _t ,a _t ) R near the current time _t R to time t + n +1 _t+n+1 Plus time t + n +1

Is a target network and has an initial value of Q. During training, needs to be fixed

To update the parameters of Q.

The Reward of reinforcement learning is an important index in an algorithm, and a proper Reward can be selected to enable a model to be converged as soon as possible. In the method of embodiment 1, the average throughput of all routers in the network is used as the Reward, that is, the average of the throughputs of all routers after a packet is transmitted from the current router to the next router is used as the current Reward. And after the data packet is transmitted from the current router to the next router according to the new flow division, the average throughput of all routers is used as the return of the next state. Because reinforcement learning maximizes future long-term benefits after selecting an Action, reward is the sum of all average throughputs for future t + n + 1.

Furthermore, in order to more accurately predict the influence of the Action on the future income, a discount factor γ can be added, which represents the influence degree of the current Action on the future income, and the influence of the current Action on nodes farther away from the current time point is smaller. Gamma can be taken according to actual conditions and is generally (0, 1).

r _t ＝Average(Throughput_r ₁ +Throughput_r ₂ +Throughput_r ₃ +……+Throughput_r _n )

Throughput_r _n Representing the throughput of the nth router, the unit may be in Mbps.

r _t Representing the average throughput of all routers on a path for the current time period of the system.

R _t The long-term expected yield of the system, i.e. the impact of this selection on the future, i.e. overall Reward, is a weighted sum of the average throughput over t + n +1 periods across all routers.

Gamma is the discount factor.

Through the formula, the calculation of the Reward can be completed, and the income value under the current flow dividing state is obtained for evaluation.

The agent and Q-function of the Actor-Critic algorithm are both neural network models. Convolutional neural networks or other available neural networks may be employed in implementations. As shown in FIG. 8, the agent pi interacts with the environment to learn knowledge, the Q-function is responsible for scoring according to the result of pi learning, and then learning a better pi ', and then assigning pi' to pi, so that iterative learning is continued until convergence. During training, the parameter of Q is estimated, and then Q is fixed to update the parameter of pi. Updating pi using a gradient-ascent algorithm, i.e.

As shown in fig. 9, intelligenceThe energy body operator and the Q-function are both neural networks. The Actor and Q are an integral network. The output of Actor is the value in the action set, i.e. the weight vector. The output of Q is a scalar, representing a fraction. In calculating action, a noise parameter method is adopted. I.e. using Q network parameters plus noise

To generate action. The action learned at this time can be

And (4) maximizing.

The following provides a set of available model training modes and parameters, which can be adjusted as needed in actual use.

The hyper-parametric configuration of the model is as follows: the learning rate is 0.001; the Batch size is 32; the optimizer is adam; the convolution kernel size is 3 x 3.

The training mode is as follows:

(1) Initializing Q-function Q and target Q-function

Order to

Initializing actor pi and target actor

Order to

(2) In each epsilon, for each time step t:

based on actor pi, according to the current state s _t Make action a _t I.e. a _t ＝π(s _t )。

Obtaining Reward r _t And enters a new state s _t+1 。

Storing(s) _t ,a _t ,r _t ,s _t+1 ) To the memory.

Batch fetching from memory(s) _t ,a _t ,r _t ,…,s _t+N ,a _t+N ,r _t+N ,s _t+N+1 ) A plurality of sample data.

Order to

Updating the parameters of Q to make Q(s) _t ,π(s _t ) Near y).

Updating the parameter of pi to maximize Q(s) _t ,π(s _t ))，

After each of the C time steps has been performed,

by the aid of the method, training of the deep reinforcement learning model can be completed, and flow segmentation can be performed according to input.

Through the selection and parameter setting of the neural network model, the reinforcement learning model required to be used in embodiment 1 can be obtained, and a deep reinforcement learning-based flow segmentation model is established and trained.

Example 3:

on the basis of the method for optimal flow segmentation based on deep reinforcement learning provided in the above embodiments 1 to 2, the present invention also provides a system for optimal flow segmentation based on deep reinforcement learning, which can be used for implementing the above method.

Fig. 10 is a schematic diagram of a system architecture according to an embodiment of the present invention, which includes one or more controllers (controllers) and all routers that need path planning. The network topology of fig. 10 is a weighted graph, each circle represents a router node, the lines between nodes represent reachable network links, the weight of each line represents the weight vector of the traffic split by the neighboring router, and the weight dynamically changes along with the re-planning of the optimal traffic split.

The controller can comprise a data storage unit for storing a training data set and an AI algorithm unit for deep reinforcement learning. The controller and the router interact through messages, so that a router state matrix of the network is obtained and stored in a database (database) of the data storage unit. During off-line training, the AI algorithm acquires training data from the database and updates the parameters of the model. When the method is applied online, an AI algorithm calculates the optimal flow segmentation in real time according to a state matrix of a router in a network, and the controller sends the optimal flow segmentation to the router through Segment Routing (SR).

According to the method provided by embodiment 1, the whole time sequence of the traffic segmentation includes two iterations, the first is data acquisition iteration, which is completed by the interaction of the controller and the router, and the acquired data is stored in the data storage unit of the controller. The second is an AI algorithm iteration, which is performed by an AI algorithm unit in the controller. The AI algorithm iteration comprises the training iteration of the GAN, the deep reinforcement learning iteration and the online application iteration.

As shown in the timing diagram of fig. 11, the controller and router may accomplish data collection and traffic splitting by the following actions.

The controller acquires the states of all routers, inputs the state matrixes of all the routers into the trained reinforcement learning model, acquires the output weight vector according to the method of any one of claims 1 to 8, and sends the output weight vector to the routers;

the router performs flow segmentation according to the received weight vector, sends the data packet to the next node, and sends the self state to the controller;

and the controller generates a state matrix of the router at the next moment by using the latest state of the router, adds a state quadruple containing the state matrix at the next moment into a training data set, and trains the reinforcement learning model by using the updated training data set.

In particular, dynamic real-time optimal traffic segmentation may be accomplished using the specific steps shown in fig. 12.

Step 301: the controller sends a router _ vector request message to the router.

Step 302: the router sends its own state vector to the controller through a router _ vector response message.

Step 303: the controller calculates a router state matrix.

Step 304: the controller sends the router state matrix to a Real Time Learning System (RTLS) through a Get _ weight _ vector request message.

Step 305: and the real-time learning system calls the previously trained AI algorithm to output a flow segmentation result, and sends the optimal segmentation proportion to the controller through a Get _ weight _ vector response message.

Step 306: the controller sends the weight vector to the router through an SR message.

Step 307: the router transmits the data packet to the next node according to the flow dividing ratio and informs the controller.

Step 308: the controller sends a router _ throughput request message to the router.

Step 309: the router sends its own throughput to the controller.

Step 310: the controller sends a router _ vector request message to the router.

Step 311: the router sends the state vector of the next moment to the controller.

Step 312: the controller calculates the average throughput of the entire network.

Step 313: the controller calculates the router state matrix for the next time instant. And storing the state quadruple [ the router state matrix, the weight vector, the average throughput and the router state matrix at the next moment ] into a database of a data storage unit so as to increase a training data set in the database.

After completing the path planning once through the steps 301 to 313, returning to the step 301 to circulate, continuously iterating to generate training data, reversely transmitting the training data to the database for storage, continuously iterating the training model by using the stored data, and continuously iterating and applying on line until the whole system is stable.

In the specific implementation process, the controller and the router are required to perform instruction message interaction to complete data acquisition iteration.

The controller sends router state vector acquisition instructions to all routers, each router sends the current state vector of the router to the controller, and the controller generates a state matrix of the whole network router according to the state vectors of all the routers;

the controller sends link state acquisition instructions to all routers, each router sends the current link state of the router to the controller, and the controller determines an Action Set of the reinforcement learning model according to the link state of the routers in the whole network;

when the router performs flow segmentation according to the weight vector, a data packet is transmitted to the next node and then is reported to the controller, the controller sends a throughput submitting instruction and a router state vector collecting instruction to the reported router, the reported router sends the current throughput and state matrix of the router to the controller, and the controller generates a router state matrix at the next moment.

In particular, the data acquisition iteration may be accomplished using the specific steps shown in FIG. 13.

Step 401: the controller sends a router _ vector request message to the router.

Step 402: the router sends its own state vector to the controller through a router _ vector response message.

Step 403: the controller calculates a router state matrix.

Step 404: the controller sends a link state request message to the router, and the router sends the link state information to the controller through the link state response message.

Step 405: the controller determines an Action Set according to the link state information of the router and randomly selects a weight vector.

Step 406: the controller sends the weight vector to the router through an SR message.

Step 407: the router transmits the data packet to the next node according to the division ratio of the weight vector and informs the controller.

Step (ii) of 408: the controller sends a router _ throughput request message to the router.

Step 409: the router sends the throughput to the controller.

Step 410: the controller sends a router _ vector request message to the router.

Step 411: and the router sends the state vector of the next moment to the controller, and the state vector is used for acquiring the throughput of the router at the current moment and the state vector of the router at the next moment.

Step 412: the controller calculates the average throughput of the network as Reward.

Step 413: the controller calculates the router state matrix for the next time instant.

Step 414: the controller stores the state quadruplet [ router state matrix, weight vector, average throughput, router state matrix at the next moment ] in the database of the controller data storage unit.

After completing one data acquisition through steps 401 to 414, returning to step 401 again, and repeating the iteration in such a loop manner, wherein the controller continuously stores the acquired data in the database to iterate the training data set.

Through the system and the respective actions and interactions of the controller and the router in the system, the method for optimal traffic segmentation based on deep reinforcement learning provided in embodiment 1 can be completed, and optimal traffic segmentation conforming to the current network state is provided in real time.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for optimal flow segmentation based on deep reinforcement learning is characterized in that:

acquiring the current states of all routers needing flow segmentation, and establishing a router state matrix;

iteratively establishing a training data set by using all router state matrixes obtained after each flow segmentation, wherein the output of the training data set is a state quadruplet of a whole network link, and the state quadruplet is specifically [ a router state matrix, a weight vector, average throughput and a next-time router state matrix ], wherein the weight vector is a proportion for each next-hop router flow segmentation;

training a reinforcement learning model by taking the sum of the average throughput of each router in a specified number of cycles after flow planning as Reward, and performing flow segmentation on each router by using the trained model;

and after the data packet is sent to the next router node according to the flow segmentation, acquiring a router state matrix at the next moment, carrying out iterative updating on the training data set, and retraining the model by using the updated training data set.

2. The method for optimal traffic segmentation based on deep reinforcement learning according to claim 1, wherein the reinforcement learning model specifically comprises:

the strategies of the reinforcement learning model comprise learning based on a strategy Actor and learning based on a value function Q, the input of the reinforcement learning model is a router state matrix of the whole network, the output of the Actor is a set of all possible weight vectors from a current router to an adjacent router, and Q is a score obtained according to the sum of average throughputs of all routers of the whole network after traffic is segmented.

3. The method of optimal traffic segmentation based on deep reinforcement learning according to claim 1, wherein the weight vector specifically comprises:

for each router, the value range of the weight vector of the flow division is [0,1], and the sum of all the weight vectors is 1.

4. The method of optimal traffic segmentation based on deep reinforcement learning according to claim 1, wherein the iteratively establishing a training data set using all router state matrices after each traffic segmentation specifically comprises:

randomly selecting a group of weight vectors as initial weight vectors, and taking state quadruples corresponding to the initial weight vectors as initial data of a training data set;

and after the router transmits the data packet to the next node each time, adding the state quadruple of the whole network topology into the training data set, and performing iterative updating on the training data set.

5. The method for optimal traffic segmentation based on deep reinforcement learning of claim 1, wherein the iteratively establishing a training data set using all router state matrices after each traffic segmentation further comprises:

generating random numbers in at least one [0,1] interval for each router, the sum of all the generated random numbers being 1, and using the generated random numbers as initial values of the weight vector of the router.

6. The method for optimal traffic segmentation based on deep reinforcement learning according to claim 1, wherein the method that the sum of average throughputs of each router in a specified number of cycles after traffic planning is Reward specifically comprises:

and taking the average throughput of all the routers as the Reward of the current period, and taking the average throughput of the whole network in each period in a specified number of periods after each traffic division as the Reward of the traffic division.

7. The method for optimal traffic segmentation based on deep reinforcement learning according to claim 1, wherein the establishing a router state matrix specifically includes:

the method comprises the steps of obtaining a state vector of each router, and forming a state matrix by the state vectors of all the routers, wherein each row in the state matrix is the state vector of one router, each column is a field in the state vector, and the state vector of each router comprises at least two items of time period, bandwidth, current load, time delay, speed and configuration index of the current time point of each router.

8. The method for optimal traffic segmentation based on deep reinforcement learning according to claim 1, wherein the training of the reinforcement learning model specifically comprises:

pre-training the reinforcement learning model by using offline data;

and/or, performing online iteration on the reinforcement learning model by using real-time data.

9. A system for optimal traffic segmentation based on deep reinforcement learning is characterized by comprising a controller and a router, and specifically comprises:

the controller acquires the states of all routers, inputs the state matrixes of all the current routers into the trained reinforcement learning model, acquires the output weight vector according to the method of any one of claims 1 to 8, and sends the output weight vector to the routers;

the router performs flow segmentation according to the received weight vector, sends the data packet to the next node and sends the self state to the controller;

10. The system for optimal traffic segmentation based on deep reinforcement learning according to claim 9, wherein the controller obtains the states of all routers, and specifically comprises:

the method comprises the following steps that a controller sends link state acquisition instructions to all routers, each router sends the current link state of the router to the controller, and the controller determines an Action Set of a reinforcement learning model according to the link state of the whole network router;