CN110581808B

CN110581808B - Congestion control method and system based on deep reinforcement learning

Info

Publication number: CN110581808B
Application number: CN201910778639.5A
Authority: CN
Inventors: 王菲; 廖旭东; 马成业; 胡海燕; 陈艳姣; 廖崎臣; 张竞之; 夏振厂
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-08-22
Filing date: 2019-08-22
Publication date: 2021-06-15
Anticipated expiration: 2039-08-22
Also published as: CN110581808A

Abstract

The invention discloses a congestion control method and a system based on deep reinforcement learning, wherein the congestion control method comprises the steps of firstly initializing the environment and model parameters of a network, then training a congestion control model by utilizing the collected current window, throughput, time delay, data sending rate and the like in the network, selecting the congestion control model with the minimum model loss function value and the maximum reward function value according to the training result, and then deploying the model into the network for congestion control. The method dynamically adjusts the size of the congestion window according to the current network throughput, round-trip delay and data packet loss rate, thereby controlling the sending rate of data, improving the network throughput, and reducing the data transmission delay and the packet loss rate, thereby reducing the occurrence of network congestion and achieving the purpose of optimizing the network performance.

Description

Congestion control method and system based on deep reinforcement learning

Technical Field

The invention relates to the technical field of computers, in particular to a congestion control method and a congestion control system based on deep reinforcement learning.

Background

The rapid development of the next generation internet technology and the rapid increase of internet application programs bring convenience to life and improve experience quality for people, and also put forward new requirements on network performance, and particularly in the aspect of network congestion control, the sending rate of data needs to be continuously adjusted according to network indexes such as the number of packets retransmitted overtime, the average packet delay, the percentage of discarded packets and the like of the network, so that the occurrence of network congestion is reduced, network resources are effectively utilized, the performance of the network is improved, and high-quality service experience is provided for users. As an important means for improving network throughput, reducing data transmission delay, reducing data packet loss rate, and other network performances, computer network congestion control has become an important research hotspot and further development direction in the field of computer network technologies.

In the prior art, congestion control methods can be mainly classified into three categories: (1) congestion control method based on packet loss. (2) A delay-based congestion method. (3) A method of congestion control based on probing.

The inventor of the present application finds that the method of the prior art has at least the following technical problems in the process of implementing the present invention:

the congestion control method based on packet loss regards packet loss as a congestion signal, and when packet loss occurs, a sending window is halved to avoid congestion. However, under the condition of no packet loss, the buffer area is continuously filled, so that the buffer area is kept in an overfilled state for a long time, which causes an excessive queuing delay, and the bandwidth utilization rate is also poor in the network environment of link packet loss. The congestion control method based on time delay uses delay as a congestion signal, and comprises sending delay, queuing delay, transmission delay and the like. And the number of network data packets can be roughly estimated by using the delay, and the congestion control protocol based on the delay can ensure that the network data packets have good performance in limiting the delay, but when the bottleneck bandwidth is shared with the data flow based on the packet loss, the bandwidth allocation is not fair due to lack of competitiveness. The congestion control method based on the detection does not set a specific congestion signal, but uses the detection method to form a congestion control strategy by means of an evaluation function. However, the control strategies all depend on training data, when the real network conditions are different, the performance is greatly reduced, and some used detection methods cannot quickly respond to the change of the network environment.

Therefore, the method in the prior art has the technical problem of poor control effect.

Disclosure of Invention

In view of the above, the present invention provides a congestion control method and system based on deep reinforcement learning, so as to solve or at least partially solve the technical problem of poor control effect of the method in the prior art.

In order to solve the technical problem, the invention provides a congestion control method based on deep reinforcement learning, which comprises the following steps:

step S1: initializing a network environment, and generating network state data, wherein the network state data comprises network delay, transmission rate, sending rate and congestion window size;

step S2: initializing parameters of a congestion control model, wherein the parameters of the congestion control model comprise a reward function, an experience pool size, a neural network structure and a learning rate;

step S3: selecting target network state data from the generated network state data, updating parameters of a neural network according to the target network state data, a reward function and a loss function, and generating different congestion control models;

step S4: and screening out an optimal model as a target congestion control model according to the value of the reward function and the value of the loss function, deploying the target congestion control model into the network, and performing congestion control.

In one embodiment, step S1 specifically includes:

step S1.1: establishing connection between two communication parties;

step S1.2: and calculating the network delay, the transmission rate, the sending rate and the size of a congestion window according to the data sent by the two communication parties through the established connection.

In one embodiment, the parameters of the congestion control model in step S2 further include:

parameters of the Q network, parameters of the target Q network, number of rounds, throughput threshold, reward threshold, and maximum number of steps for a round.

In one embodiment, step S3 specifically includes:

step S3.1: according to the target network state data, exploring with the probability size being in the range of ∈ and randomly taking action, or selecting the action argmax with the maximum Q value under the current state according to the probability of 1 ∈_aQ(φ(s_t) A; θ), where e is a probability variable, Q represents a value calculated by the neural network when taking different actions, a represents a different action, φ(s)_t) Represents the state at time t;

step S3.2: and updating the neural network parameters in a mode of minimizing a loss function according to the obtained value of the Reward function, and generating different congestion control models.

In one embodiment, the method further comprises:

and judging whether the steps of the current round are finished, if the accumulated reward value of the current round is smaller than a reward threshold value and the throughput is smaller than a throughput threshold value or the step number of the current round is larger than or equal to the maximum step number value of one round, generating a congestion control model, and otherwise, starting the next step of the round, wherein each round corresponds to one round of training.

In one embodiment, the neural network in step S3 includes an input layer, a hidden layer and an output layer, wherein the hidden layer includes two convolutional layers for extracting features of the input data set and two fully-connected layers for integrating local information with category distinctiveness in the convolutional layers.

In one embodiment, the reward function is of the form:

Reward＝α*tput-β*RTT-γ*packet_loss_rate (1)

wherein Reward represents a Reward value, RTT represents a network delay, packet _ loss _ rate represents a loss rate of a data packet, which is a ratio of a packet loss number to a transmission packet number, and α, β, and γ values are preset parameters.

In one embodiment, the throughput is calculated as follows:

tput＝0.008*(delivered-last_delivered)/max(1,duration) (2)

wherein, t represents throughput, duration represents the total time length of the current data stream, delayed represents the current data transmission amount, and last _ delayed represents the last data transmission amount.

Based on the same inventive concept, the second aspect of the present invention provides a congestion control system based on deep reinforcement learning, which includes:

the device comprises a parameter initialization module, a congestion control module and a congestion control module, wherein the parameter initialization module is used for initializing a network environment and generating network state data, and the network state data comprises network delay, transmission rate, sending rate and congestion window size;

the system comprises an environment initialization module, a congestion control module and a management module, wherein the environment initialization module is used for initializing parameters of the congestion control model, and the parameters of the congestion control model comprise a reward function, an experience pool size, a neural network structure and a learning rate;

the model generation module is used for selecting target network state data from the generated network state data, updating parameters of the neural network according to the target network state data, the reward function and the loss function, and generating different congestion control models;

and the congestion control module is used for screening out an optimal model as a target congestion control model according to the value of the reward function and the value of the loss function, deploying the target congestion control model into the network and performing congestion control.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

the invention provides a congestion control method based on deep reinforcement learning, which comprises the steps of firstly, initializing a network environment and obtaining network state data, wherein the network state data comprises network delay, transmission rate, sending rate and congestion window size; then initializing parameters of a congestion control model, wherein the parameters of the congestion control model comprise a reward function, an experience pool size, a neural network structure and a learning rate; then, selecting target network state data from the generated network state data, updating parameters of the neural network according to the target network state data, the reward function and the loss function, and generating different congestion control models; and finally, screening out an optimal model as a target congestion control model according to the value of the reward function and the value of the loss function, deploying the target congestion control model into the network, and performing congestion control.

The invention can train the congestion control model by utilizing the selected network state data (network time delay, transmission rate, sending rate and congestion window size) and the like, select the congestion control model with the minimum model loss function value and the maximum reward function value according to the training result, and then deploy the model into the network to control the congestion. The method dynamically adjusts the size of the congestion window according to the current network throughput, round-trip delay and data packet loss rate, thereby controlling the sending rate of data, reducing the occurrence of network congestion and achieving the purpose of optimizing network performance. The technical problem of poor control effect in the prior art is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a congestion control method based on deep reinforcement learning according to the present invention;

FIG. 2 is a diagram illustrating the overall steps of a congestion control method according to an embodiment of the present invention;

FIG. 3 is a flowchart of an initialization run environment of an embodiment of the present invention;

fig. 4 is a schematic diagram of adaptive adjustment of a congestion window according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of model parameter update according to an embodiment of the present invention;

fig. 6 is a block diagram of a congestion control system based on deep reinforcement learning according to an embodiment of the present invention.

Detailed Description

The invention aims to provide a congestion control method and system based on deep reinforcement learning, aiming at the technical problem of poor control effect of the method in the prior art, so as to achieve the purpose of improving the congestion control effect.

In order to achieve the above purpose, the main concept of the invention is as follows:

the invention provides a congestion control method and system based on deep reinforcement learning, which are mainly based on reinforcement learning and utilize performance indexes such as current window, throughput, time delay, data sending rate and the like in a network. The existing network congestion control technology generally controls the sending rate based on the Time delay (Round-Trip Time, RTT), the packet loss rate (Lose rate), and the like, and although the network congestion can be solved to a certain extent, the congestion window cannot be adjusted according to the real network environment, and the overall performance is not as good as that of the present invention. The method of the invention can fully utilize the performance index of the network, generate the congestion control model through deep reinforcement learning, and generate appropriate values (the size and the direction of the congestion window) to adjust the size and the direction of the network congestion window, so as to improve the network throughput, reduce the packet loss rate and the time delay, and further solve the network congestion. By the method and the device, better network performance can be obtained, and the obtained result is more scientific and accurate.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The present embodiment provides a congestion control method based on deep reinforcement learning, please refer to fig. 1, and the method includes:

step S1: initializing a network environment, and generating network state data, wherein the network state data comprises a network delay, a transmission rate, a sending rate, and a congestion window size.

Specifically, step S1 is to initialize parameters of the computer network and then generate network status data.

In a specific implementation, step S1 specifically includes:

step S1.1: establishing connection between two communication parties;

Specifically, before the program starts, it is necessary to initialize a network environment, establish a connection between both communication parties, calculate status data such as a network delay (RTT), a transmission rate (delivery rate), a sending rate (sending rate), and a congestion window size (cwnd) of a network by data transmission of both communication parties, and store the data in an experience pool. After a certain amount of data is stored in the experience pool, a certain amount of state data can be randomly taken from the experience pool to prepare for the operation of each step (i.e., subsequent training).

Step S2: initializing parameters of a congestion control model, wherein the parameters of the congestion control model comprise a reward function, an experience pool size, a neural network structure and a learning rate.

The parameters of the congestion control model in step S2 further include:

Specifically, before training the model, it is first necessary to initialize the experience pool, the parameters of the Q network, and the parameters of the target Q network. Then a round is initialized, a throughput preset bad-put is set and a reward threshold bad-reward is set and a maximum step number max-step of a round is set.

Step S3: and selecting target network state data from the generated network state data, updating parameters of the neural network according to the target network state data, the reward function and the loss function, and generating different congestion control models.

Specifically, data may be randomly selected as the target network status data from the network status data generated in step S1. And then training the neural network by using the target network state data, wherein the reward function and the loss function are used for adjusting and updating parameters of the neural network so as to obtain different congestion control models.

In one embodiment, step S3 specifically includes:

Specifically, in reinforcement learning, the congestion control is to change the size of a window by an action, and the basis of what action is taken is status data. The two modes are two mechanisms of action selection by DQN (deep reinforcement learning). Reinforcement learning defines an environment for an Agent to implement certain actions to maximize rewards.

When the parameters are updated, the Q network and target Q parameters need to be adjusted according to the feedback of the reward function received by the Agent, and are updated respectively. In the optimization process, after a certain number of time steps, the target Q parameter of the target network is updated to be the parameter of the eval net of the current training network. The Q network and the target Q (target Q network) are two networks in the DNQ model of the present invention, and they have the same structure, but the updating manner and function are different.

To optimize the neural network of the present invention, a small batch of data in the experience pool can be randomly extracted at each optimization, and the optimization is by minimizing a loss function L (θ), which is defined as follows:

wherein the content of the first and second substances,

Q^-(·|θ^-) For the network with the parameter theta-time obtained from the last target network update (i.e. the actual value of Q calculated from the parameters of the actual neural network), Q(s)_t，a_t| θ) is an estimate of Q, s_t，a_tRespectively representing the state at time t and the action taken, r_tMax is the maximum Q value for the value of the reward function obtained, and E is the expectation of obtaining a at the current s. The smaller the difference between the actual value of Q and the estimated value of Q, the better. The loss function of the invention is minimized by a random gradient method, thereby effectively achieving the purpose of optimizing the network.

In one embodiment, the method further comprises:

In one embodiment, the reward function is of the form:

Reward＝α*tput-β*RTT-γ*packet_loss_rate (1)

In one embodiment, the throughput is calculated as follows:

tput＝0.008*(delivered-last_delivered)/max(1,duration) (2)

Specifically, whether the steps of the current round are finished or not is judged, and if the steps are rewarded < bad-rewarded and the t < bad-tput or step-count (the step number of the current round) > (max-step), a congestion control model is generated and stored. A new round is then started and step-count is made 0. Otherwise, starting the next step of the round, and increasing the number of the steps in a manner of making step-count equal to step-count + 1; the program runs continuously to generate different congestion control models.

Specifically, according to the sizes of the Loss and the Reward, a corresponding model with a low Loss value and a high Reward value in a period of time is selected as an obtained optimal model and deployed in the environment. And selecting the optimal action for adjusting the size and the direction of the congestion window according to the state of the network. The implementation mode is as follows:

and acquiring the current network state, and obtaining an action according to the network state, such as cwnd (cwnd) (. 2). This action is then executed to expand the current congestion window by a factor of two. The sender makes a decision as to whether or not an ack (acknowledgement message) is obtained from the receiver, and if not, waits until it is obtained. After the ack is obtained, the state and the reward are updated, the update of the state is to observe the network delay (RTT), the transmission rate (delivery rate), the sending rate (sending rate) and the congestion window size (cwnd) network state of the network link in step S1, the update of the reward is to calculate according to a reward function, and the flow is ended.

Specifically, when training the neural network, a part of samples in the "experience pool" may be randomly extracted to train the model (the experience pool is a container for storing data, and stores historical data), firstly, the number of extracted samples is fixed to be "history _ length", and in the present invention, the history _ length is 16, that is, 16 sets of state data are randomly selected in the "experience pool" each time, and are used as the input of the first layer input layer in the deep neural network, and then the output of the first layer enters the first layer convolution layer in the hidden layer. The number of channels of input signals of the first layer of the convolutional layer is 16, the number of output channels is 32, the activation function is ReLU and is used as the input of the second layer of the convolutional layer, the number of channels of input signals of the second layer of the convolutional layer is 32, the number of output channels is 64, the activation function is also ReLU, after the two layers of the convolutional layer are processed, extracted characteristic data are flattened (Flatter), then the flattened data are input into the first layer of the fully-connected layer, the output is processed through the ReLU and then enters the second layer of the fully-connected layer, and the output of the second layer of the fully-connected layer is the Q value corresponding to different actions required by the invention.

In order to more clearly illustrate the implementation and beneficial effects of the method provided by the invention, the following detailed description is given by specific examples.

Please refer to fig. 2, which is a general step diagram of the congestion control method. Firstly initializing a network environment, generating network state data, then judging whether a model needs to be trained, if not, executing a left branch, otherwise, executing a right branch. I.e. whether the training model or the loading model is selected to run on the actual link. The training model is used for training agents for reinforcement learning to train a plurality of models. And the loading model is to select a trained model for carrying out congestion control on the network.

And the left branch correspondingly selects a loading model to run in an actual link, at the moment, a network environment is set, state data is selected to run, a corresponding action is obtained, and then the size of a congestion window is adjusted.

The right branch corresponds to a training model, firstly, parameters of the model need to be initialized, then state data are randomly selected, when model training is carried out, an Agent is adopted to observe a Sender and a Receiver in Environment, and observed data states are sent to a DQN neural network. The DQN continuously learns through reward of data sending and Environment feedback, and an action is adopted as a mode for adjusting the congestion window. After the Environment takes the action, the Environment feeds back to the Agent a reward as a reward and punishment of the previous step, and the reward and punishment are used for measuring the performance of the action given by the Agent on congestion control of the previous step. The value of Reward is calculated from equation (1).

Reward＝α*tput-β*RTT-γ*packet_loss_rate (1)

Wherein, the throughput is the throughput of the network and is calculated by formula (2); RTT is the round trip delay of the network and is calculated by formula (4); the packet _ loss _ rate is calculated by formula (3), and experiments prove that when the value of alpha is 0.6, the value of beta is 0.2, and the value of gamma is 0.2, the performance of the network is optimal.

tput＝0.008*(delivered-last_delivered)/max(1，duration) (2)

Wherein duration represents the total time length of opening the current data stream, and delivery and last _ delivery represent the transmission number of the current time and the last time;

packet_loss_rate＝loss_num/send_num (3)

where loss _ num and send _ num represent the number of packets lost and sent, respectively.

The training process is as follows: the first round needs to be initialized, each round has unequal number of steps, and the specific situation needs to be judged. Each step is run. Referring to fig. 4, flatten indicates flattening, an activation function is ReLU, a hidden layer includes two convolutional layers and two fully-connected layers, extracted feature data is flattened (Flatter) after the two convolutional layers of the hidden layer are processed, then the flattened data is input into a first fully-connected layer, the data is output after the processing of the ReLU through the activation function and then enters a second fully-connected layer, and the output of the second fully-connected layer is Q values corresponding to different actions required by the present invention.

The action is obtained from the neural network according to the current state, specifically, an action list corresponding to five actions set by the invention is firstly designed through experience and multiple experiments: ["+0.0","-100.0","+100.0","*2","/2.0"]These five actions represent cwnd_t+1＝cwnd_t,cwnd_t+1＝cwnd_t-100，cwnd_t+1＝cwnd_t+100，cwnd_t+1＝cwnd_t*2，cwnd_t+1＝cwnd_t/2. After each training, obtaining Q values corresponding to each action at the current state through the output of the deep neural network, selecting the action with the maximum Q value to further obtain an index to the action list, and determining the action to be selected by the agent, wherein if the neural network learns that the Q value of the second action is maximum in the training process of the time, the Q value of-100.0' in the Q values output by the network is maximum, further obtaining the index to the action list, and obtaining the action list [1 ]]I.e., -100.0", so agent is allowed to take action, i.e., cwnd_t+1＝cwnd_t100, achieving the purpose of adjusting the window. After the action is obtained, the action is executed. And judging whether the ack of the receiver is obtained or not, and continuously waiting until the ack is obtained if the ack is not obtained. After ack is obtained, the state needs to be updated, which is to observe four parameters of the network link (the network delay, the transmission rate, the sending rate and the congestion window size mentioned above), and the rewarded update is calculated by using the existing state according to a rewarded function.

And acquiring the current network state, and obtaining action according to the network state. This action is then performed to reduce the current congestion window by 100. And judging whether the ack of the receiver is obtained or not, and continuously waiting until the ack is obtained if the ack is not obtained. After ack is obtained, the state and the rewarded are updated, the state updating is to observe four parameters of the network link in the step 1, the rewarded updating is to calculate according to a rewarded function, and the process is ended.

Referring to FIG. 3, a flowchart for initializing an operating environment is shown. The state of the incoming neural network is RTT, delivery rate (delivery rate), sending rate (sendingrate) and congestion window size (cwnd) which need to be calculated by equations (4), (5) and (6). Calculating the Q value of each action through the operation of the neural network, and selecting the action corresponding to the maximum Q value as the current regulation mode.

RTT＝float(curr_time_ms-ack.send_ts) (4)

Curr _ time _ ms represents the current ack receiving time of the sender, and ack _ send _ ts represents the time of sending the packet corresponding to the ack;

delayed and ack, respectively representing the transmission number of packets and the transmission number of ack, and delayed _ time and ack, respectively representing the transmission time of packets and ack;

send_rate＝0.008*(self.sent_bytes-ack.sent_bytes)/max(1,self.rtt) (6)

send bytes represents the total bytes sent and ack send bytes represents the bytes that have been sent recently.

And operating an observation program. Specifically, as shown in fig. 5:

the buffers and history data are updated and they hold the previous relevant parameters. Then, it is determined whether or not the learning mini-batch is reached, that is, whether or not the minimum learning data is reached, which is based on the currently learned step number, lean-step-counter, the set learning step number, lean-start, and the training frequency, train-frequency, and if lean-step-counter > lean-start and lean _ step _ counter% train _ frequency is 0, the neural network starts learning. Next, it is determined whether the condition of update target-q is reached, and if the condition is reached, specifically, a frequency count target _ q _ update _ step of update target-q is set, and if left _ step _ counter% target _ q _ update _ step is 0, q-value is updated.

Generally, compared with the prior art, the technical scheme of the invention has the following advantages and beneficial effects:

the invention provides a congestion control method and system based on deep reinforcement learning by using performance indexes such as a current window, throughput, time delay, data sending rate and the like in a network based on reinforcement learning. The existing network congestion control technology generally modifies the sending rate based on the Time delay (Round-Trip Time, RTT), the packet loss rate (Lose rate), and the like, and although the network congestion can be solved to a certain extent, the congestion window cannot be adjusted according to the real network adjustment, and the overall performance is not as good as that of the present invention. The method of the invention can fully utilize the performance index of the network, adopts a proper value through deep reinforcement learning to adjust the size and the direction of the network congestion window, thereby improving the network throughput, reducing the packet loss rate and the time delay and further solving the network congestion. By the method and the device, better network performance can be obtained, and the obtained result is more scientific and accurate.

Example two

Based on the same inventive concept, the present embodiment provides a congestion control system based on deep reinforcement learning, please refer to fig. 6, the system includes:

a parameter initialization module 201, configured to initialize a network environment and generate network status data, where the network status data includes a network delay, a transmission rate, a sending rate, and a size of a congestion window;

the environment initialization module 202 is configured to initialize parameters of a congestion control model, where the parameters of the congestion control model include a reward function, an experience pool size, a neural network structure, and a learning rate;

the model generation module 203 is used for selecting target network state data from the generated network state data, updating parameters of the neural network according to the target network state data, the reward function and the loss function, and generating different congestion control models;

and the congestion control module 204 is configured to screen out an optimal model as a target congestion control model according to the value of the reward function and the value of the loss function, deploy the target congestion control model to the network, and perform congestion control.

In one embodiment, the environment initialization module 202 is specifically configured to perform the following steps:

step S1.1: establishing connection between two communication parties;

In one embodiment, the parameters of the congestion control model further comprise:

In one embodiment, the model generation module 203 is specifically configured to perform the following steps:

In one embodiment, the system further comprises a determining module configured to:

In one embodiment, the neural network comprises an input layer, a hidden layer and an output layer, wherein the hidden layer comprises two convolutional layers and two fully-connected layers, the convolutional layers are used for extracting characteristics of an input data set, and the fully-connected layers are used for integrating local information with category distinctiveness in the convolutional layers.

In one embodiment, the reward function is of the form:

Reward＝α*tput-β*RTT-γ*packet_loss_rate (1)

In one embodiment, the throughput is calculated as follows:

tput＝0.008*(delivered-last_delivered)/max(1,duration) (2)

Since the system described in the second embodiment of the present invention is a system for implementing the congestion control method based on deep reinforcement learning in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the system based on the method described in the first embodiment of the present invention, and thus details thereof are not described herein. All systems adopted by the method of the first embodiment of the present invention are within the intended protection scope of the present invention.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A congestion control method based on deep reinforcement learning is characterized by comprising the following steps:

2. The method according to claim 1, wherein step S1 specifically comprises:

step S1.1: establishing connection between two communication parties;

3. The method of claim 1, wherein the parameters of the congestion control model in step S2 further comprise:

4. The method according to claim 3, wherein step S3 specifically comprises:

step S3.1: according to the target network state data, the probability is used as the E to search, actions are randomly taken, or the probability of 1-E is used to select the Q value under the current state to be the maximumLarge action argmax_aQ(φ(s_t) A; θ), where e is a probability variable, Q represents a value calculated by the neural network when taking different actions, a represents a different action, φ(s)_t) Represents the state at time t, and theta represents a parameter of the neural network;

step S3.2: and updating parameters of the neural network in a mode of minimizing a loss function according to the acquired value of the reward function, and generating different congestion control models.

5. The method of claim 4, wherein the method further comprises:

6. The method of claim 1, wherein the neural network in step S3 comprises an input layer, a hidden layer and an output layer, wherein the hidden layer comprises two convolutional layers for extracting features of the input data set and two fully-connected layers for integrating local information with category distinctiveness in the convolutional layers.

7. The method of claim 1, wherein the reward function is of the form:

Reward＝α*tput-β*RTT-γ*packet_loss_rate

wherein, tput represents throughput, Reward represents a Reward value, RTT represents network delay, packet _ loss _ rate represents a loss rate of a data packet, which is a ratio of a packet loss number to a transmission packet number, and α, β, and γ values are preset parameters.

8. The method of claim 5, wherein the throughput is calculated as follows:

tput＝0.008*(delivered-last_delivered)/max(1,duration)

9. A congestion control system based on deep reinforcement learning, comprising: