CN113055229A - Wireless network self-selection protocol method based on DDQN - Google Patents

Wireless network self-selection protocol method based on DDQN Download PDF

Info

Publication number
CN113055229A
CN113055229A CN202110249773.3A CN202110249773A CN113055229A CN 113055229 A CN113055229 A CN 113055229A CN 202110249773 A CN202110249773 A CN 202110249773A CN 113055229 A CN113055229 A CN 113055229A
Authority
CN
China
Prior art keywords
network
state
action
reward
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110249773.3A
Other languages
Chinese (zh)
Other versions
CN113055229B (en
Inventor
严海蓉
王重阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110249773.3A priority Critical patent/CN113055229B/en
Publication of CN113055229A publication Critical patent/CN113055229A/en
Application granted granted Critical
Publication of CN113055229B publication Critical patent/CN113055229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W36/00Hand-off or reselection arrangements
    • H04W36/0005Control or signalling for completing the hand-off
    • H04W36/0055Transmission or use of information for re-establishing the radio link
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W36/00Hand-off or reselection arrangements
    • H04W36/0005Control or signalling for completing the hand-off
    • H04W36/0083Determination of parameters used for hand-off, e.g. generation or modification of neighbour cell lists
    • H04W36/00837Determination of triggering parameters for hand-off
    • H04W36/008375Determination of triggering parameters for hand-off based on historical data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W36/00Hand-off or reselection arrangements
    • H04W36/24Reselection being triggered by specific parameters
    • H04W36/30Reselection being triggered by specific parameters by measured or perceived connection quality data

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention relates to a DDQN-based wireless network protocol self-selection method, which aims at the conditions of complex wireless network environment and fusion of a plurality of protocols. The method comprises the following steps: acquiring a current network environment quality parameter and determining a node service type in real time through an environment agent module; carrying out noise reduction and normalization processing on data on the basis of 1), determining the node service type through an analytic hierarchy process, and carrying out feature extraction; and 2) inputting data into the DDQN decision network for real-time training on the basis of the data, and applying an execution result to enable the network state to tend to be stable. According to the invention, the characteristic extraction is directly carried out without preprocessing the data, the obtained historical data is used as training data, and the learning speed and the decision performance of the reinforcement learning algorithm are effectively improved by utilizing the strong advantages of deep learning.

Description

Wireless network self-selection protocol method based on DDQN
Technical Field
The invention relates to a network protocol self-selection method under a heterogeneous wireless network, aiming at the conditions of complex wireless network environment and fusion of a plurality of protocols.
Background
With the continuous development of network technologies, a great deal of overlapping is generated in network technologies widely applied in the world today, the current network environment WLAN and cellular network are the most common heterogeneous network combination, and play an important role in modern information communication, and operators also deploy their WLAN hotspots in user-dense areas such as shopping malls, schools, and office buildings to disperse the pressure caused by the cellular networks.
The next generation heterogeneous network is a network that integrates multiple protocols but has a complex environment, and needs to provide reliable network services to users at any time and any place. Before this is achieved, however, the network environment needs to be mature, and functions such as wireless network coverage, network self-configuration, automatic management of network devices, and the like need to be solved. In the existing network environment, it is difficult to implement a single network protocol to complete the above configuration, but the resource comprehensive scheduling of the current heterogeneous network can be implemented through some algorithms, and the efficient switching of utilizing heterogeneous network resources will gradually become a research hotspot. With the further development of wireless communication, certain requirements will be put on the scalability and flexibility of heterogeneous networks.
The reinforcement learning is used as a tool which can make a decision meeting the development environment requirement in an uncertain environment, and can carry out targeted adjustment according to the dynamic change of the network, so that the heterogeneous wireless network can become a scheme automatically adapting to the scene change of a user, and the network environment is optimized. The reinforcement learning is one of machine learning, and can finally realize maximization of a certain specific index (Reward) by continuously adjusting an Agent in an Environment (Environment), in a wireless network, due to movement of nodes and mutual interference among the nodes, the network Environment becomes complex, compared with a traditional machine learning algorithm, the reinforcement learning has higher potential and higher accuracy, the characteristic extraction is directly carried out without preprocessing data, the obtained historical data is used as training data, and the learning speed and decision performance of the reinforcement learning algorithm are effectively improved by utilizing the strong advantages of the reinforcement learning.
Disclosure of Invention
In view of the above existing features of the network, the present invention provides a method for wireless network self-selection protocol based on DDQN (Deep discovery Learning with Double Q-Learning). The method comprises the following steps: a processing scheme of network quality data; a feature extraction scheme based on deep learning; network protocol selection scheme based on DDQN. The purpose of the invention is realized by the following technical scheme.
A method for wireless network self-selection protocol based on DDQN, the method includes the following steps:
1) acquiring a current network environment quality parameter and determining a node service type in real time through an environment agent module;
2) carrying out noise reduction and normalization processing on data on the basis of 1), determining the node service type through an analytic hierarchy process, and carrying out feature extraction;
3) and 2) inputting data into the DDQN decision network for real-time training on the basis of the data, and applying an execution result to enable the network state to tend to be stable.
1. A method for wireless network self-selection protocol based on DDQN is characterized by comprising the following steps:
the first step is as follows: acquiring current network environment quality parameters and node service type determination states, actions and reward values in real time through an environment agent module;
state space definition: the state space S of a terminal at time t is defined as SmnThe e belongs to S and represents the state of the terminal m when accessing the nth network and performing information interaction in the network; the state space is:
S=s1,s2,…,smn (1)
and (3) state definition: the network state is described by using average throughput T, delay D, signal strength P, and node distance W, and then the network quality Φ is expressed as:
Φ=T×D×P×W (2)
and (3) defining an action space: an action space needs to be set for the agent to select, and the definition of the action space is as follows:
A=a1,a2,…,an (3)
wherein a isnIndicating that a node uses the nth network protocol;
the access service network parameters are composed of QoS parameters, a decision matrix is established for the network QoS, and the parameter weight is solved:
Figure BDA0002964348760000031
the decision matrix is as shown above, wherein each element represents the importance of the QoS parameter, and is defined as the following table, and the decision matrix should satisfy mij>0;mji=1/mij;mij=1;
Figure BDA0002964348760000032
2, 4, 6 and 8 which are not shown in the table are used for representing intermediate values of adjacent judgment; since the service types are classified into 4 types in the process of defining the reward value and three attributes of throughput, time delay and signal strength are considered, the decision matrix should be defined as a matrix of 3 x 3, namely Mi∈R3×3Wherein, i is 1, 2, 3, 4 to represent 1 type, 2 type, 3 type, 4 four kinds of business types separately, then set up the decision matrix to four kinds of business separately according to the demand of different business QoS parameter;
according to the current network service type division standard RFC2474, determining an attribute value in a service level through a DSCP; DSCP determines IP priority by encoding values using used 6 bits and unused 2 bits in the service class TOS flag byte of each packet IP header; the IP priority field can be applied to flow classification, the larger the numerical value is, the higher the priority is, the value is from 0 to 63, 64 classes can be matched, the IP priority field is divided into one class according to the class size in every seven classes, namely, the relation between the service attribute and the parameter can be determined by sending the DSCP field in the IP data packet;
for the four types of services, i takes values of 1, 2, 3 and 4 in sequence; will be maximum specificFeature vector normalization of feature values, i.e.
Figure BDA0002964348760000041
Each value in the normalized feature vector is the weight of the corresponding network QoS parameter; in the above four cases, differences of different service types on the requirements of network parameters are generated, and the differences will affect the division of the reward value weights; considering the whole network as a whole, the final goal is to optimize the whole network quality by selecting nodes to use the protocol, and the reward value is a function which is strongly related to the network;
Vt=v1,v2,…,vn (5)
t represents the state information of the network at time t, VtIs a subset Φ of the network state space, thus, for a particular service B, the network space state VtThe reward function R is expressed and will be solved in the next step:
R=fB(Vt) (6)
the access of the node can affect the change of network parameters, and after the action is executed, the network state needs to be measured and corresponding rewards need to be fed back; when the executed action causes the network throughput to be increased, the time delay to be reduced and the signal strength to be enhanced, the action is an effective action; conversely, when the executed action causes the reduction of network throughput, the reduction of time delay and the reduction of signal strength, the action is an invalid action; thus the average throughput a is taken into account when calculating the rewardavgAverage time delay betaavgSignal intensity γ;
the second step is that: carrying out normalization processing on data on the basis of 1), determining the node service type and determining a reward function;
the influence of data due to unit difference is eliminated by using min-max standardization:
Figure BDA0002964348760000051
the normalized average throughput of the network is obtained by normalizing by using the equationQuantity ft(α)avgAverage time delay ft(β)avgSignal strength ft(γ);
The reward function is obtained by combining the formula:
R=ω1ft(α)avg2ft(β)avg3ft(γ) (8)
wherein ω is1、ω2、ω3The weights of the average throughput, the time delay and the signal intensity of the network corresponding to the feature vector after the normalization of the decision matrix;
the third step: on the basis of 2), inputting data into a DDQN decision network for real-time training, and applying an execution result to enable the network state to tend to be stable;
firstly, initializing a state S and an action space A, initializing a Q matrix to be a zero matrix, initializing a Q-MainNet network and a Q-target network by using a random parameter theta, wherein the theta is a network parameter, randomly setting the Q-MainNet theta during initialization, and setting the Q-target theta-T represents the current time state, and the intelligent module reads the current network state information StIt is input to the Q-MainNet network, at StOutputting Q values of different actions in the state through a Q-MainNet network; according to the epsilon-greedy strategy, the Q-MainNet network randomly selects an action a according to the probability epsilontE.g. A, or select an action with a probability of 1-epsilon
Figure BDA0002964348760000052
Figure BDA0002964348760000053
The terminal executes corresponding actions in the heterogeneous wireless network, processes the actions into a format required by an algorithm through acquiring network data, and sends the format to the control layer for processing; thereby obtaining throughput alpha, time delay beta and signal strength gamma; then normalizing them separately; f is obtained by analytic hierarchy process according to service typet(α)avg、ft(β)avg、ft(γ) and then weighted and summed to obtain the reward value R; Q-MainNet obtains system state and reward value through publicFormula (9)
Figure 100002_DEST_PATH_IMAGE001
Performing a prize value calculation wherein Rt+1Is corresponding to St+1Calculating the reward under the state, wherein gamma is an attenuation coefficient, and the reward value of the intelligent agent under the current state is actually all possible reward values in the future and is converted into the reward value at the moment; after the action is executed, the system enters the next state St+1
Q-MainNet network to memorize group(s)t,at,rt,st+1) I.e. the current state stMotion space atCurrent prize value rtAnd t +1 network states are stored in an experience pool, from which the Q-target network randomly samples at each step, and together with the output of the Q-MainNet network calculates the difference of the loss values with respect to the parameter θ between the two networks Q, i.e. (TargetQ-Q (S)t+1,a;θt))2Executing a gradient descent algorithm; after each iteration, copying the parameters of the Q-MainNet network to the Q-target network; the training is continuously performed in a circulating way.
Drawings
Fig. 1 is an overall flow diagram of a method of DDQN based wireless network self-selection protocol;
FIG. 2 is a running diagram of the DDQN algorithm;
Detailed Description
The specific steps of the method for selecting a protocol based on DDQN for a wireless network implemented according to the present invention are described below with reference to fig. 1 as follows:
the first step is as follows: acquiring current network environment quality parameters and node service type determination states, actions and reward values in real time through an environment agent module;
to use the reinforcement learning algorithm, state, action and reward value are defined, and the network quality parameter is input as the state value.
State space definition: the state space S of a terminal at time t is defined as SmnBelongs to S and represents a terminalAnd m is accessed to the nth network and carries out information interaction in the network. The state space is:
S=s1,s2,…,smn (1)
and (3) state definition: description of network indexes in a heterogeneous network generally uses throughput, delay, packet loss rate, network load, etc. to describe a network service state, uses network signal strength, node distance, node power consumption, cost, signal-to-noise ratio to describe user characteristics, and uses average throughput T, delay D, signal strength P, and node distance W to describe a network state, then network quality Φ may be expressed as:
Φ=T×D×P×W (2)
and (3) defining an action space: an action space needs to be set for the agent to select, and the definition of the action space is as follows:
A=a1,a2,…,an (3)
wherein a isnIndicating that a node uses the nth network protocol.
Reward value definition: each node has the characteristics of respective specific service when being created, and has the service type thereof, and even under the same network environment, the nodes correspondingly have different reward values. Combining with actual requirements, the node service types are divided into the following categories:
1. the real-time requirement is high, the time delay is as low as possible, the transmission rate is high, and if the time delay is too large, the service implementation is affected. And a certain throughput is also required to ensure the reliability of data.
2. The requirement on throughput is extremely high, the requirement on real-time performance is not strong compared with the requirement on service 1, and a large data flow is required.
3. The requirement on time delay is high, network flow under emergency needs to be dealt with, time delay is reduced as much as possible, and user experience is improved.
4. Only sufficient throughput needs to be guaranteed.
The access service network parameters are composed of QoS parameters, a decision matrix is established for the network QoS, and the parameter weight is solved:
Figure BDA0002964348760000071
the decision matrix is shown as a formula, wherein each element represents the importance degree of the QoS parameter, specifically as table definition, and the decision matrix should satisfy mij>0;mji=1/mij;mij=1。
Figure BDA0002964348760000081
TABLE 1 relationship of Properties to parameters
2, 4, 6, 8, which are not shown in Table 1, are used to indicate intermediate values of the adjacent judgment. Since the service types are classified into 4 types in the process of defining the reward value and three attributes of throughput, time delay and signal strength are considered, the decision matrix should be defined as a matrix of 3 x 3, namely Mi∈R3×3Wherein, i ═ 1, 2, 3, 4 represent four kinds of business types of 1 type, 2 type, 3 type, 4 type separately, then set up the decision matrix to four kinds of business separately according to the demand of different business QoS parameter.
According to the current network service type classification standard RFC2474, attribute values in a service class are determined by dscp (differentiated Services Code point). The DSCP determines the IP priority by encoding values using the used 6 bits and the unused 2 bits in the class of service TOS flag byte of each packet IP header. The IP priority field can be applied to flow classification, the larger the numerical value is, the higher the priority is, the value is from 0 to 63, 64 classes can be matched, the IP priority field is divided into one class according to the class size in every seven classes, namely, the relation between the service attribute and the parameter can be determined by sending the DSCP field in the IP data packet.
For these four types of services, i takes values of 1, 2, 3, and 4 in sequence. Normalizing the eigenvector corresponding to the largest eigenvalue, i.e.
Figure BDA0002964348760000091
Each value in the normalized feature vector is the weight of the corresponding network QoS parameter. In the above four cases, differences in the requirements of different traffic types on network parameters will occur, and these differences will then affect the partition of the bonus value weights. Considering the entire network as a whole, the ultimate goal will be to optimize the overall network quality by selecting nodes to use the protocol, with the reward value being a function that has a strong correlation to the network.
Vt=v1,v2,…,vn (5)
t represents the state information of the network at time t, VtIs a subset Φ of the network state space, thus, for a particular service B, the network space state VtThe reward function R is expressed and will be solved in the next step:
R=fB(Vt) (6)
the access of the node will affect the change of the network parameters, and after the action is executed, the network state needs to be measured and corresponding rewards need to be fed back. When the executed action causes the network throughput to be increased, the time delay to be reduced and the signal strength to be enhanced, the action is an effective action; conversely, an invalid action is determined when the executed action causes the network throughput to be reduced, the time delay to be reduced and the signal strength to be reduced. Thus the average throughput a is taken into account when calculating the rewardavgAverage time delay betaavgSignal intensity γ.
The second step is that: carrying out normalization processing on data on the basis of 1), determining the node service type and determining a reward function;
different network parameters have large difference in unit and value, normalization processing is needed, all values are linearly transformed, and the values are mapped to [0,1 ].
The influence of data due to unit difference is eliminated by using min-max standardization:
Figure BDA0002964348760000092
the normalized average throughput f of the network is obtained by normalization with the equationt(α)avgAverage time delay ft(β)avgSignal strength ft(γ)。
Combining the above equations can derive the reward function:
R=ω1ft(α)avg2ft(β)avg3ft(γ) (8)
wherein ω is1、ω2、ω3And the weights of the average throughput, the time delay and the signal intensity of the network corresponding to the feature vector after the normalization of the decision matrix.
The third step: and 2) inputting data into the DDQN decision network for real-time training on the basis of the data, and applying an execution result to enable the network state to tend to be stable.
One of the biggest disadvantages of using DQN is that although the argmax () method can get Q values close to the target quickly, it is likely to result in overestimation, which is the large deviation of the algorithm model we get. To solve this problem, errors can be eliminated by separating the target Q value calculation and the target Q value selection. The network information is in a discrete state, and the DDQN can well process data in the discrete state.
With reference to fig. 2, two neural networks are used in DQN for implementation, Q-MainNet and Q-target respectively. In the same manner, the DDQN uses two networks for operation, but the target Q value is calculated differently.
Firstly, initializing a state S and an action space A, initializing a Q matrix to be a zero matrix, initializing a Q-MainNet network and a Q-target network by using a random parameter theta, wherein the theta is a network parameter, randomly setting the Q-MainNet theta during initialization, and setting the Q-target theta-T represents the current time state, and the intelligent module reads the current network state information S)It is input to the Q-MainNet network, at StAnd outputting the Q values of different actions in the state through the Q-MainNet network. According to the epsilon-greedy strategy, the Q-MainNet network randomly selects an action a according to the probability epsilontE.g. A, or select an action with a probability of 1-epsilon
Figure BDA0002964348760000101
Figure BDA0002964348760000102
The terminal executes corresponding actions in the heterogeneous wireless network, processes the actions into a format required by the algorithm after acquiring network data, and sends the format to the control layer for processing. Thereby obtaining the throughput alpha, the time delay beta and the signal strength gamma. They are then normalized separately. F is obtained by analytic hierarchy process according to service typet(α)avg、ft(β)avg、ftThe weight of (gamma) and the weighted sum then yields the reward value R. Q-MainNet obtains the system status and the reward value through the formula (9)
Figure DEST_PATH_IMAGE002
Performing a prize value calculation wherein Rt+1Is corresponding to St+1The calculated reward under the state, gamma is a decay coefficient, and the reward value of the intelligent agent under the current state is actually all possible reward values in the future and is converted into the reward value at the moment. After the action is executed, the system enters the next state St+1
Q-MainNet network to memorize group(s)t,at,rt,st+1) I.e. the current state stMotion space atCurrent prize value rtAnd t +1 network states are stored in an experience pool, from which the Q-target network randomly samples at each step, and together with the output of the Q-MainNet network calculates the difference of the loss values with respect to the parameter θ between the two networks Q, i.e. (TargetQ-Q (S)t+1,a;θt))2And executing a gradient descent algorithm. And copying the parameters of the Q-MainNet network to the Q-target network every G steps. The training is continuously performed in a circulating way.

Claims (1)

1. A method for wireless network self-selection protocol based on DDQN is characterized by comprising the following steps:
the first step is as follows: acquiring current network environment quality parameters and node service type determination states, actions and reward values in real time through an environment agent module;
state space definition: the state space S of a terminal at time t is defined as SmnThe e belongs to S and represents the state of the terminal m when accessing the nth network and performing information interaction in the network; the state space is:
S=s1,s2,...,smn (1)
and (3) state definition: the network state is described by using average throughput T, delay D, signal strength P, and node distance W, and then the network quality Φ is expressed as:
Φ=T×D×P×W (2)
and (3) defining an action space: an action space needs to be set for the agent to select, and the definition of the action space is as follows:
A=a1,a2,...,an (3)
wherein a isnIndicating that a node uses the nth network protocol;
the access service network parameters are composed of QoS parameters, a decision matrix is established for the network QoS, and the parameter weight is solved:
Figure FDA0002964348750000011
the decision matrix is as shown above, wherein each element represents the importance of the QoS parameter, and is defined as the following table, and the decision matrix should satisfy mij>0;mji=1/mij;mij=1;
Figure FDA0002964348750000021
2, 4, 6 and 8 which are not shown in the table are used for representing intermediate values of adjacent judgment; since the service types are classified into 4 types in the process of defining the reward value and three attributes of throughput, time delay and signal strength are considered, the decision matrix should be defined as a matrix of 3 x 3, namely Mi∈R3×3Wherein i is 1, 2, 3, 4Respectively representing four service types of 1 type, 2 type, 3 type and 4 type, and then respectively establishing a decision matrix for the four services according to the requirements of different service QoS parameters;
according to the current network service type division standard RFC2474, determining an attribute value in a service level through a DSCP; DSCP determines IP priority by encoding values using used 6 bits and unused 2 bits in the service class TOS flag byte of each packet IP header; the IP priority field can be applied to flow classification, the larger the numerical value is, the higher the priority is, the value is from 0 to 63, 64 classes can be matched, the IP priority field is divided into one class according to the class size in every seven classes, namely, the relation between the service attribute and the parameter can be determined by sending the DSCP field in the IP data packet;
for the four types of services, i takes values of 1, 2, 3 and 4 in sequence; normalizing the eigenvector corresponding to the largest eigenvalue, i.e.
Figure FDA0002964348750000022
Each value in the normalized feature vector is the weight of the corresponding network QoS parameter; in the above four cases, differences of different service types on the requirements of network parameters are generated, and the differences will affect the division of the reward value weights; considering the whole network as a whole, the final goal is to optimize the whole network quality by selecting nodes to use the protocol, and the reward value is a function which is strongly related to the network;
Vt=v1,v2,…,vn (5)
t represents the state information of the network at time t, VtIs a subset Φ of the network state space, thus, for a particular service B, the network space state VtThe reward function R is expressed and will be solved in the next step:
R=fB(Vt) (6)
the access of the node can affect the change of network parameters, and after the action is executed, the network state needs to be measured and corresponding rewards need to be fed back; when the executed action causes the network throughput to be increased, the time delay to be reduced and the signal intensity to be strongDegree enhancement is effective action; conversely, when the executed action causes the reduction of network throughput, the reduction of time delay and the reduction of signal strength, the action is an invalid action; thus the average throughput a is taken into account when calculating the rewardavgAverage time delay betaavgSignal intensity γ;
the second step is that: carrying out normalization processing on data on the basis of 1), determining the node service type and determining a reward function;
the influence of data due to unit difference is eliminated by using min-max standardization:
Figure FDA0002964348750000031
the normalized average throughput f of the network is obtained by normalization with the equationt(α)avgAverage time delay ft(β)avgSignal strength ft(γ);
The reward function is obtained by combining the formula:
R=ω1ft(α)avg2ft(β)avg3ft(γ) (8)
wherein ω is1、ω2、ω3The weights of the average throughput, the time delay and the signal intensity of the network corresponding to the feature vector after the normalization of the decision matrix;
the third step: on the basis of 2), inputting data into a DDQN decision network for real-time training, and applying an execution result to enable the network state to tend to be stable;
firstly, initializing a state S and an action space A, initializing a Q matrix to be a zero matrix, initializing a Q-MainNet network and a Q-target network by using a random parameter theta, wherein the theta is a network parameter, randomly setting the Q-MainNet theta during initialization, and setting the Q-target theta-T represents the current time state, and the intelligent module reads the current network state information StIt is input to the Q-MainNet network, at StOutputting Q values of different actions in the state through a Q-MainNet network; according to the epsilon-greedy strategy, the Q-MainNet network follows the probability epsilonThe machine selects an action atE.g. A, or select an action with a probability of 1-epsilon
Figure FDA0002964348750000041
Figure FDA0002964348750000042
The terminal executes corresponding actions in the heterogeneous wireless network, processes the actions into a format required by an algorithm through acquiring network data, and sends the format to the control layer for processing; thereby obtaining throughput alpha, time delay beta and signal strength gamma; then normalizing them separately; f is obtained by analytic hierarchy process according to service typet(α)avg、ft(β)avg、ft(γ) and then weighted and summed to obtain the reward value R; Q-MainNet obtains the system status and the reward value through the formula (9)
Figure DEST_PATH_IMAGE001
Performing a prize value calculation wherein Rt+1Is corresponding to St+1Calculating the reward under the state, wherein gamma is an attenuation coefficient, and the reward value of the intelligent agent under the current state is actually all possible reward values in the future and is converted into the reward value at the moment; after the action is executed, the system enters the next state St+1
Q-MainNet network to memorize group(s)t,at,rt,st+1) I.e. the current state stMotion space atCurrent prize value rtAnd t +1 network states are stored in an experience pool, from which the Q-target network randomly samples at each step, and together with the output of the Q-MainNet network calculates the difference of the loss values with respect to the parameter θ between the two networks Q, i.e. (TargetQ-Q (S)t+1,a;θt))2Executing a gradient descent algorithm; after each iteration, copying the parameters of the Q-MainNet network to the Q-target network; the training is continuously performed in a circulating way.
CN202110249773.3A 2021-03-05 2021-03-05 Wireless network self-selection protocol method based on DDQN Active CN113055229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110249773.3A CN113055229B (en) 2021-03-05 2021-03-05 Wireless network self-selection protocol method based on DDQN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110249773.3A CN113055229B (en) 2021-03-05 2021-03-05 Wireless network self-selection protocol method based on DDQN

Publications (2)

Publication Number Publication Date
CN113055229A true CN113055229A (en) 2021-06-29
CN113055229B CN113055229B (en) 2023-10-27

Family

ID=76510598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110249773.3A Active CN113055229B (en) 2021-03-05 2021-03-05 Wireless network self-selection protocol method based on DDQN

Country Status (1)

Country Link
CN (1) CN113055229B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103327556A (en) * 2013-07-04 2013-09-25 中国人民解放军理工大学通信工程学院 Dynamic network selection method for optimizing quality of experience (QoE) of user in heterogeneous wireless network
CN105208624A (en) * 2015-08-27 2015-12-30 重庆邮电大学 Service-based multi-access network selection system and method in heterogeneous wireless network
CN107889195A (en) * 2017-11-16 2018-04-06 电子科技大学 A kind of self study heterogeneous wireless network access selection method of differentiated service
CN110809306A (en) * 2019-11-04 2020-02-18 电子科技大学 Terminal access selection method based on deep reinforcement learning
WO2021013368A1 (en) * 2019-07-25 2021-01-28 Telefonaktiebolaget Lm Ericsson (Publ) Machine learning based adaption of qoe control policy

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103327556A (en) * 2013-07-04 2013-09-25 中国人民解放军理工大学通信工程学院 Dynamic network selection method for optimizing quality of experience (QoE) of user in heterogeneous wireless network
CN105208624A (en) * 2015-08-27 2015-12-30 重庆邮电大学 Service-based multi-access network selection system and method in heterogeneous wireless network
CN107889195A (en) * 2017-11-16 2018-04-06 电子科技大学 A kind of self study heterogeneous wireless network access selection method of differentiated service
WO2021013368A1 (en) * 2019-07-25 2021-01-28 Telefonaktiebolaget Lm Ericsson (Publ) Machine learning based adaption of qoe control policy
CN110809306A (en) * 2019-11-04 2020-02-18 电子科技大学 Terminal access selection method based on deep reinforcement learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
张媛媛等: "一种新的面向用户多业务QoS需求的网络接入选择算法", 《计算机科学》 *
张媛媛等: "一种新的面向用户多业务QoS需求的网络接入选择算法", 《计算机科学》, vol. 42, no. 3, 31 March 2015 (2015-03-31) *
马礼等: "基于Markov模型的接入网络选择算法", 《计算机工程》 *
马礼等: "基于Markov模型的接入网络选择算法", 《计算机工程》, vol. 45, no. 5, 31 May 2019 (2019-05-31) *

Also Published As

Publication number Publication date
CN113055229B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
WO2021233053A1 (en) Computing offloading method and communication apparatus
Sun et al. Autonomous resource slicing for virtualized vehicular networks with D2D communications based on deep reinforcement learning
CN111866954B (en) User selection and resource allocation method based on federal learning
CN111510879B (en) Heterogeneous Internet of vehicles network selection method and system based on multi-constraint utility function
CN106534333A (en) Bidirectional selection computing unloading method based on MEC and MCC
CN111711666B (en) Internet of vehicles cloud computing resource optimization method based on reinforcement learning
WO2019184836A1 (en) Data analysis device, and multi-model co-decision system and method
Zhu et al. Adaptive multi-access algorithm for multi-service edge users in 5G ultra-dense heterogeneous networks
Jiang et al. A decoupled learning strategy for massive access optimization in cellular IoT networks
CN107889195B (en) Self-learning heterogeneous wireless network access selection method for distinguishing services
Sekaran et al. 5G integrated spectrum selection and spectrum access using AI-based frame work for IoT based sensor networks
CN114465945B (en) SDN-based identification analysis network construction method
CN113727420B (en) Multimode access network selection device and method
CN113676357B (en) Decision method for edge data processing in power internet of things and application thereof
CN114726743A (en) Service function chain deployment method based on federal reinforcement learning
Rao et al. Network selection in heterogeneous environment: A step toward always best connected and served
Wu et al. Link congestion prediction using machine learning for software-defined-network data plane
CN112804702B (en) Multi-link air-ground data exchange link performance evaluation method based on utility function
CN112508408A (en) Mapping model construction method of wireless resource management index under edge calculation
CN113055229A (en) Wireless network self-selection protocol method based on DDQN
CN113543160A (en) 5G slice resource allocation method and device, computing equipment and computer storage medium
CN106535226A (en) Multi-network cooperative selection and aggregation method based on synergetic theory
CN112260941B (en) Heterogeneous network data fusion method
CN114449536A (en) 5G ultra-dense network multi-user access selection method based on deep reinforcement learning
CN111723978A (en) Index evaluation method for adapting to difference demands of various power services based on virtual mapping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant