CN113055229A

CN113055229A - Wireless network self-selection protocol method based on DDQN

Info

Publication number: CN113055229A
Application number: CN202110249773.3A
Authority: CN
Inventors: 严海蓉; 王重阳
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-06-29
Anticipated expiration: 2041-03-05
Also published as: CN113055229B

Abstract

The invention relates to a DDQN-based wireless network protocol self-selection method, which aims at the conditions of complex wireless network environment and fusion of a plurality of protocols. The method comprises the following steps: acquiring a current network environment quality parameter and determining a node service type in real time through an environment agent module; carrying out noise reduction and normalization processing on data on the basis of 1), determining the node service type through an analytic hierarchy process, and carrying out feature extraction; and 2) inputting data into the DDQN decision network for real-time training on the basis of the data, and applying an execution result to enable the network state to tend to be stable. According to the invention, the characteristic extraction is directly carried out without preprocessing the data, the obtained historical data is used as training data, and the learning speed and the decision performance of the reinforcement learning algorithm are effectively improved by utilizing the strong advantages of deep learning.

Description

Wireless network self-selection protocol method based on DDQN

Technical Field

The invention relates to a network protocol self-selection method under a heterogeneous wireless network, aiming at the conditions of complex wireless network environment and fusion of a plurality of protocols.

Background

With the continuous development of network technologies, a great deal of overlapping is generated in network technologies widely applied in the world today, the current network environment WLAN and cellular network are the most common heterogeneous network combination, and play an important role in modern information communication, and operators also deploy their WLAN hotspots in user-dense areas such as shopping malls, schools, and office buildings to disperse the pressure caused by the cellular networks.

The next generation heterogeneous network is a network that integrates multiple protocols but has a complex environment, and needs to provide reliable network services to users at any time and any place. Before this is achieved, however, the network environment needs to be mature, and functions such as wireless network coverage, network self-configuration, automatic management of network devices, and the like need to be solved. In the existing network environment, it is difficult to implement a single network protocol to complete the above configuration, but the resource comprehensive scheduling of the current heterogeneous network can be implemented through some algorithms, and the efficient switching of utilizing heterogeneous network resources will gradually become a research hotspot. With the further development of wireless communication, certain requirements will be put on the scalability and flexibility of heterogeneous networks.

The reinforcement learning is used as a tool which can make a decision meeting the development environment requirement in an uncertain environment, and can carry out targeted adjustment according to the dynamic change of the network, so that the heterogeneous wireless network can become a scheme automatically adapting to the scene change of a user, and the network environment is optimized. The reinforcement learning is one of machine learning, and can finally realize maximization of a certain specific index (Reward) by continuously adjusting an Agent in an Environment (Environment), in a wireless network, due to movement of nodes and mutual interference among the nodes, the network Environment becomes complex, compared with a traditional machine learning algorithm, the reinforcement learning has higher potential and higher accuracy, the characteristic extraction is directly carried out without preprocessing data, the obtained historical data is used as training data, and the learning speed and decision performance of the reinforcement learning algorithm are effectively improved by utilizing the strong advantages of the reinforcement learning.

Disclosure of Invention

In view of the above existing features of the network, the present invention provides a method for wireless network self-selection protocol based on DDQN (Deep discovery Learning with Double Q-Learning). The method comprises the following steps: a processing scheme of network quality data; a feature extraction scheme based on deep learning; network protocol selection scheme based on DDQN. The purpose of the invention is realized by the following technical scheme.

A method for wireless network self-selection protocol based on DDQN, the method includes the following steps:

1) acquiring a current network environment quality parameter and determining a node service type in real time through an environment agent module;

2) carrying out noise reduction and normalization processing on data on the basis of 1), determining the node service type through an analytic hierarchy process, and carrying out feature extraction;

3) and 2) inputting data into the DDQN decision network for real-time training on the basis of the data, and applying an execution result to enable the network state to tend to be stable.

1. A method for wireless network self-selection protocol based on DDQN is characterized by comprising the following steps:

the first step is as follows: acquiring current network environment quality parameters and node service type determination states, actions and reward values in real time through an environment agent module;

state space definition: the state space S of a terminal at time t is defined as S_mnThe e belongs to S and represents the state of the terminal m when accessing the nth network and performing information interaction in the network; the state space is:

S＝s₁,s₂,…,s_mn (1)

and (3) state definition: the network state is described by using average throughput T, delay D, signal strength P, and node distance W, and then the network quality Φ is expressed as:

Φ＝T×D×P×W (2)

and (3) defining an action space: an action space needs to be set for the agent to select, and the definition of the action space is as follows:

A＝a₁,a₂,…,a_n (3)

wherein a is_nIndicating that a node uses the nth network protocol;

the access service network parameters are composed of QoS parameters, a decision matrix is established for the network QoS, and the parameter weight is solved:

the decision matrix is as shown above, wherein each element represents the importance of the QoS parameter, and is defined as the following table, and the decision matrix should satisfy m_ij>0；m_ji＝1/m_ij；m_ij＝1；

2, 4, 6 and 8 which are not shown in the table are used for representing intermediate values of adjacent judgment; since the service types are classified into 4 types in the process of defining the reward value and three attributes of throughput, time delay and signal strength are considered, the decision matrix should be defined as a matrix of 3 x 3, namely M_i∈R_3×3Wherein, i is 1, 2, 3, 4 to represent 1 type, 2 type, 3 type, 4 four kinds of business types separately, then set up the decision matrix to four kinds of business separately according to the demand of different business QoS parameter;

according to the current network service type division standard RFC2474, determining an attribute value in a service level through a DSCP; DSCP determines IP priority by encoding values using used 6 bits and unused 2 bits in the service class TOS flag byte of each packet IP header; the IP priority field can be applied to flow classification, the larger the numerical value is, the higher the priority is, the value is from 0 to 63, 64 classes can be matched, the IP priority field is divided into one class according to the class size in every seven classes, namely, the relation between the service attribute and the parameter can be determined by sending the DSCP field in the IP data packet;

for the four types of services, i takes values of 1, 2, 3 and 4 in sequence; will be maximum specificFeature vector normalization of feature values, i.e.

Each value in the normalized feature vector is the weight of the corresponding network QoS parameter; in the above four cases, differences of different service types on the requirements of network parameters are generated, and the differences will affect the division of the reward value weights; considering the whole network as a whole, the final goal is to optimize the whole network quality by selecting nodes to use the protocol, and the reward value is a function which is strongly related to the network;

V_t＝v₁,v₂,…,v_n (5)

t represents the state information of the network at time t, V_tIs a subset Φ of the network state space, thus, for a particular service B, the network space state V_tThe reward function R is expressed and will be solved in the next step:

R＝f_B(V_t) (6)

the access of the node can affect the change of network parameters, and after the action is executed, the network state needs to be measured and corresponding rewards need to be fed back; when the executed action causes the network throughput to be increased, the time delay to be reduced and the signal strength to be enhanced, the action is an effective action; conversely, when the executed action causes the reduction of network throughput, the reduction of time delay and the reduction of signal strength, the action is an invalid action; thus the average throughput a is taken into account when calculating the reward_avgAverage time delay beta_avgSignal intensity γ;

the second step is that: carrying out normalization processing on data on the basis of 1), determining the node service type and determining a reward function;

the influence of data due to unit difference is eliminated by using min-max standardization:

the normalized average throughput of the network is obtained by normalizing by using the equationQuantity f_t(α)_avgAverage time delay f_t(β)_avgSignal strength f_t(γ)；

The reward function is obtained by combining the formula:

R＝ω₁f_t(α)_avg+ω₂f_t(β)_avg+ω₃f_t(γ) (8)

wherein ω is₁、ω₂、ω₃The weights of the average throughput, the time delay and the signal intensity of the network corresponding to the feature vector after the normalization of the decision matrix;

the third step: on the basis of 2), inputting data into a DDQN decision network for real-time training, and applying an execution result to enable the network state to tend to be stable;

firstly, initializing a state S and an action space A, initializing a Q matrix to be a zero matrix, initializing a Q-MainNet network and a Q-target network by using a random parameter theta, wherein the theta is a network parameter, randomly setting the Q-MainNet theta during initialization, and setting the Q-target theta^-T represents the current time state, and the intelligent module reads the current network state information S_tIt is input to the Q-MainNet network, at S_tOutputting Q values of different actions in the state through a Q-MainNet network; according to the epsilon-greedy strategy, the Q-MainNet network randomly selects an action a according to the probability epsilon_tE.g. A, or select an action with a probability of 1-epsilon

The terminal executes corresponding actions in the heterogeneous wireless network, processes the actions into a format required by an algorithm through acquiring network data, and sends the format to the control layer for processing; thereby obtaining throughput alpha, time delay beta and signal strength gamma; then normalizing them separately; f is obtained by analytic hierarchy process according to service type_t(α)_avg、f_t(β)_avg、f_t(γ) and then weighted and summed to obtain the reward value R; Q-MainNet obtains system state and reward value through publicFormula (9)

Performing a prize value calculation wherein R_t+1Is corresponding to S_t+1Calculating the reward under the state, wherein gamma is an attenuation coefficient, and the reward value of the intelligent agent under the current state is actually all possible reward values in the future and is converted into the reward value at the moment; after the action is executed, the system enters the next state S_t+1；

Q-MainNet network to memorize group(s)_t,a_t,r_t,s_t+1) I.e. the current state s_tMotion space a_tCurrent prize value r_tAnd t +1 network states are stored in an experience pool, from which the Q-target network randomly samples at each step, and together with the output of the Q-MainNet network calculates the difference of the loss values with respect to the parameter θ between the two networks Q, i.e. (TargetQ-Q (S)_t+1,a；θ_t))²Executing a gradient descent algorithm; after each iteration, copying the parameters of the Q-MainNet network to the Q-target network; the training is continuously performed in a circulating way.

Drawings

Fig. 1 is an overall flow diagram of a method of DDQN based wireless network self-selection protocol;

FIG. 2 is a running diagram of the DDQN algorithm;

Detailed Description

The specific steps of the method for selecting a protocol based on DDQN for a wireless network implemented according to the present invention are described below with reference to fig. 1 as follows:

to use the reinforcement learning algorithm, state, action and reward value are defined, and the network quality parameter is input as the state value.

State space definition: the state space S of a terminal at time t is defined as S_mnBelongs to S and represents a terminalAnd m is accessed to the nth network and carries out information interaction in the network. The state space is:

S＝s₁,s₂,…,s_mn (1)

and (3) state definition: description of network indexes in a heterogeneous network generally uses throughput, delay, packet loss rate, network load, etc. to describe a network service state, uses network signal strength, node distance, node power consumption, cost, signal-to-noise ratio to describe user characteristics, and uses average throughput T, delay D, signal strength P, and node distance W to describe a network state, then network quality Φ may be expressed as:

Φ＝T×D×P×W (2)

A＝a₁,a₂,…,a_n (3)

wherein a is_nIndicating that a node uses the nth network protocol.

Reward value definition: each node has the characteristics of respective specific service when being created, and has the service type thereof, and even under the same network environment, the nodes correspondingly have different reward values. Combining with actual requirements, the node service types are divided into the following categories:

1. the real-time requirement is high, the time delay is as low as possible, the transmission rate is high, and if the time delay is too large, the service implementation is affected. And a certain throughput is also required to ensure the reliability of data.

2. The requirement on throughput is extremely high, the requirement on real-time performance is not strong compared with the requirement on service 1, and a large data flow is required.

3. The requirement on time delay is high, network flow under emergency needs to be dealt with, time delay is reduced as much as possible, and user experience is improved.

4. Only sufficient throughput needs to be guaranteed.

the decision matrix is shown as a formula, wherein each element represents the importance degree of the QoS parameter, specifically as table definition, and the decision matrix should satisfy m_ij>0；m_ji＝1/m_ij；m_ij＝1。

TABLE 1 relationship of Properties to parameters

2, 4, 6, 8, which are not shown in Table 1, are used to indicate intermediate values of the adjacent judgment. Since the service types are classified into 4 types in the process of defining the reward value and three attributes of throughput, time delay and signal strength are considered, the decision matrix should be defined as a matrix of 3 x 3, namely M_i∈R_3×3Wherein, i ═ 1, 2, 3, 4 represent four kinds of business types of 1 type, 2 type, 3 type, 4 type separately, then set up the decision matrix to four kinds of business separately according to the demand of different business QoS parameter.

According to the current network service type classification standard RFC2474, attribute values in a service class are determined by dscp (differentiated Services Code point). The DSCP determines the IP priority by encoding values using the used 6 bits and the unused 2 bits in the class of service TOS flag byte of each packet IP header. The IP priority field can be applied to flow classification, the larger the numerical value is, the higher the priority is, the value is from 0 to 63, 64 classes can be matched, the IP priority field is divided into one class according to the class size in every seven classes, namely, the relation between the service attribute and the parameter can be determined by sending the DSCP field in the IP data packet.

For these four types of services, i takes values of 1, 2, 3, and 4 in sequence. Normalizing the eigenvector corresponding to the largest eigenvalue, i.e.

Each value in the normalized feature vector is the weight of the corresponding network QoS parameter. In the above four cases, differences in the requirements of different traffic types on network parameters will occur, and these differences will then affect the partition of the bonus value weights. Considering the entire network as a whole, the ultimate goal will be to optimize the overall network quality by selecting nodes to use the protocol, with the reward value being a function that has a strong correlation to the network.

V_t＝v₁,v₂,…,v_n (5)

R＝f_B(V_t) (6)

the access of the node will affect the change of the network parameters, and after the action is executed, the network state needs to be measured and corresponding rewards need to be fed back. When the executed action causes the network throughput to be increased, the time delay to be reduced and the signal strength to be enhanced, the action is an effective action; conversely, an invalid action is determined when the executed action causes the network throughput to be reduced, the time delay to be reduced and the signal strength to be reduced. Thus the average throughput a is taken into account when calculating the reward_avgAverage time delay beta_avgSignal intensity γ.

different network parameters have large difference in unit and value, normalization processing is needed, all values are linearly transformed, and the values are mapped to [0,1 ].

the normalized average throughput f of the network is obtained by normalization with the equation_t(α)_avgAverage time delay f_t(β)_avgSignal strength f_t(γ)。

Combining the above equations can derive the reward function:

R＝ω₁f_t(α)_avg+ω₂f_t(β)_avg+ω₃f_t(γ) (8)

wherein ω is₁、ω₂、ω₃And the weights of the average throughput, the time delay and the signal intensity of the network corresponding to the feature vector after the normalization of the decision matrix.

The third step: and 2) inputting data into the DDQN decision network for real-time training on the basis of the data, and applying an execution result to enable the network state to tend to be stable.

One of the biggest disadvantages of using DQN is that although the argmax () method can get Q values close to the target quickly, it is likely to result in overestimation, which is the large deviation of the algorithm model we get. To solve this problem, errors can be eliminated by separating the target Q value calculation and the target Q value selection. The network information is in a discrete state, and the DDQN can well process data in the discrete state.

With reference to fig. 2, two neural networks are used in DQN for implementation, Q-MainNet and Q-target respectively. In the same manner, the DDQN uses two networks for operation, but the target Q value is calculated differently.

Firstly, initializing a state S and an action space A, initializing a Q matrix to be a zero matrix, initializing a Q-MainNet network and a Q-target network by using a random parameter theta, wherein the theta is a network parameter, randomly setting the Q-MainNet theta during initialization, and setting the Q-target theta^-T represents the current time state, and the intelligent module reads the current network state information S₎It is input to the Q-MainNet network, at S_tAnd outputting the Q values of different actions in the state through the Q-MainNet network. According to the epsilon-greedy strategy, the Q-MainNet network randomly selects an action a according to the probability epsilon_tE.g. A, or select an action with a probability of 1-epsilon

The terminal executes corresponding actions in the heterogeneous wireless network, processes the actions into a format required by the algorithm after acquiring network data, and sends the format to the control layer for processing. Thereby obtaining the throughput alpha, the time delay beta and the signal strength gamma. They are then normalized separately. F is obtained by analytic hierarchy process according to service type_t(α)_avg、f_t(β)_avg、f_tThe weight of (gamma) and the weighted sum then yields the reward value R. Q-MainNet obtains the system status and the reward value through the formula (9)

Performing a prize value calculation wherein R_t+1Is corresponding to S_t+1The calculated reward under the state, gamma is a decay coefficient, and the reward value of the intelligent agent under the current state is actually all possible reward values in the future and is converted into the reward value at the moment. After the action is executed, the system enters the next state S_t+1。

Q-MainNet network to memorize group(s)_t,a_t,r_t,s_t+1) I.e. the current state s_tMotion space a_tCurrent prize value r_tAnd t +1 network states are stored in an experience pool, from which the Q-target network randomly samples at each step, and together with the output of the Q-MainNet network calculates the difference of the loss values with respect to the parameter θ between the two networks Q, i.e. (TargetQ-Q (S)_t+1,a；θ_t))²And executing a gradient descent algorithm. And copying the parameters of the Q-MainNet network to the Q-target network every G steps. The training is continuously performed in a circulating way.

Claims

S＝s₁，s₂，...，s_mn (1)

Φ＝T×D×P×W (2)

A＝a₁，a₂，...，a_n (3)

wherein a is_nIndicating that a node uses the nth network protocol;

the decision matrix is as shown above, wherein each element represents the importance of the QoS parameter, and is defined as the following table, and the decision matrix should satisfy m_ij＞0；m_ji＝1/m_ij；m_ij＝1；

2, 4, 6 and 8 which are not shown in the table are used for representing intermediate values of adjacent judgment; since the service types are classified into 4 types in the process of defining the reward value and three attributes of throughput, time delay and signal strength are considered, the decision matrix should be defined as a matrix of 3 x 3, namely M_i∈R_3×3Wherein i is 1, 2, 3, 4Respectively representing four service types of 1 type, 2 type, 3 type and 4 type, and then respectively establishing a decision matrix for the four services according to the requirements of different service QoS parameters;

for the four types of services, i takes values of 1, 2, 3 and 4 in sequence; normalizing the eigenvector corresponding to the largest eigenvalue, i.e.

V_t＝v₁，v₂，…，v_n (5)

R＝f_B(V_t) (6)

the access of the node can affect the change of network parameters, and after the action is executed, the network state needs to be measured and corresponding rewards need to be fed back; when the executed action causes the network throughput to be increased, the time delay to be reduced and the signal intensity to be strongDegree enhancement is effective action; conversely, when the executed action causes the reduction of network throughput, the reduction of time delay and the reduction of signal strength, the action is an invalid action; thus the average throughput a is taken into account when calculating the reward_avgAverage time delay beta_avgSignal intensity γ;

the normalized average throughput f of the network is obtained by normalization with the equation_t(α)_avgAverage time delay f_t(β)_avgSignal strength f_t(γ)；

The reward function is obtained by combining the formula:

R＝ω₁f_t(α)_avg+ω₂f_t(β)_avg+ω₃f_t(γ) (8)

firstly, initializing a state S and an action space A, initializing a Q matrix to be a zero matrix, initializing a Q-MainNet network and a Q-target network by using a random parameter theta, wherein the theta is a network parameter, randomly setting the Q-MainNet theta during initialization, and setting the Q-target theta^-T represents the current time state, and the intelligent module reads the current network state information S_tIt is input to the Q-MainNet network, at S_tOutputting Q values of different actions in the state through a Q-MainNet network; according to the epsilon-greedy strategy, the Q-MainNet network follows the probability epsilonThe machine selects an action a_tE.g. A, or select an action with a probability of 1-epsilon

The terminal executes corresponding actions in the heterogeneous wireless network, processes the actions into a format required by an algorithm through acquiring network data, and sends the format to the control layer for processing; thereby obtaining throughput alpha, time delay beta and signal strength gamma; then normalizing them separately; f is obtained by analytic hierarchy process according to service type_t(α)_avg、f_t(β)_avg、f_t(γ) and then weighted and summed to obtain the reward value R; Q-MainNet obtains the system status and the reward value through the formula (9)

Q-MainNet network to memorize group(s)_t，a_t，r_t，s_t+1) I.e. the current state s_tMotion space a_tCurrent prize value r_tAnd t +1 network states are stored in an experience pool, from which the Q-target network randomly samples at each step, and together with the output of the Q-MainNet network calculates the difference of the loss values with respect to the parameter θ between the two networks Q, i.e. (TargetQ-Q (S)_t+1，a；θ_t))²Executing a gradient descent algorithm; after each iteration, copying the parameters of the Q-MainNet network to the Q-target network; the training is continuously performed in a circulating way.