CN113055229A - Wireless network self-selection protocol method based on DDQN - Google Patents
Wireless network self-selection protocol method based on DDQN Download PDFInfo
- Publication number
- CN113055229A CN113055229A CN202110249773.3A CN202110249773A CN113055229A CN 113055229 A CN113055229 A CN 113055229A CN 202110249773 A CN202110249773 A CN 202110249773A CN 113055229 A CN113055229 A CN 113055229A
- Authority
- CN
- China
- Prior art keywords
- network
- state
- action
- reward
- service
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000010606 normalization Methods 0.000 claims abstract description 12
- 230000008569 process Effects 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 10
- 238000012549 training Methods 0.000 claims abstract description 10
- 230000009467 reduction Effects 0.000 claims abstract description 8
- 230000009471 action Effects 0.000 claims description 50
- 239000011159 matrix material Substances 0.000 claims description 27
- 230000000875 corresponding effect Effects 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 13
- 239000003795 chemical substances by application Substances 0.000 claims description 11
- HRULVFRXEOZUMJ-UHFFFAOYSA-K potassium;disodium;2-(4-chloro-2-methylphenoxy)propanoate;methyl-dioxido-oxo-$l^{5}-arsane Chemical compound [Na+].[Na+].[K+].C[As]([O-])([O-])=O.[O-]C(=O)C(C)OC1=CC=C(Cl)C=C1C HRULVFRXEOZUMJ-UHFFFAOYSA-K 0.000 claims description 8
- 230000008859 change Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000003993 interaction Effects 0.000 claims description 3
- 230000002787 reinforcement Effects 0.000 abstract description 7
- 238000000605 extraction Methods 0.000 abstract description 5
- 238000013135 deep learning Methods 0.000 abstract description 2
- 230000004927 fusion Effects 0.000 abstract description 2
- 238000007781 pre-processing Methods 0.000 abstract description 2
- 238000010187 selection method Methods 0.000 abstract description 2
- 238000011161 development Methods 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/145—Network analysis or design involving simulating, designing, planning or modelling of a network
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/02—Arrangements for optimising operational condition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W36/00—Hand-off or reselection arrangements
- H04W36/0005—Control or signalling for completing the hand-off
- H04W36/0055—Transmission or use of information for re-establishing the radio link
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W36/00—Hand-off or reselection arrangements
- H04W36/0005—Control or signalling for completing the hand-off
- H04W36/0083—Determination of parameters used for hand-off, e.g. generation or modification of neighbour cell lists
- H04W36/00837—Determination of triggering parameters for hand-off
- H04W36/008375—Determination of triggering parameters for hand-off based on historical data
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W36/00—Hand-off or reselection arrangements
- H04W36/24—Reselection being triggered by specific parameters
- H04W36/30—Reselection being triggered by specific parameters by measured or perceived connection quality data
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Environmental & Geological Engineering (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention relates to a DDQN-based wireless network protocol self-selection method, which aims at the conditions of complex wireless network environment and fusion of a plurality of protocols. The method comprises the following steps: acquiring a current network environment quality parameter and determining a node service type in real time through an environment agent module; carrying out noise reduction and normalization processing on data on the basis of 1), determining the node service type through an analytic hierarchy process, and carrying out feature extraction; and 2) inputting data into the DDQN decision network for real-time training on the basis of the data, and applying an execution result to enable the network state to tend to be stable. According to the invention, the characteristic extraction is directly carried out without preprocessing the data, the obtained historical data is used as training data, and the learning speed and the decision performance of the reinforcement learning algorithm are effectively improved by utilizing the strong advantages of deep learning.
Description
Technical Field
The invention relates to a network protocol self-selection method under a heterogeneous wireless network, aiming at the conditions of complex wireless network environment and fusion of a plurality of protocols.
Background
With the continuous development of network technologies, a great deal of overlapping is generated in network technologies widely applied in the world today, the current network environment WLAN and cellular network are the most common heterogeneous network combination, and play an important role in modern information communication, and operators also deploy their WLAN hotspots in user-dense areas such as shopping malls, schools, and office buildings to disperse the pressure caused by the cellular networks.
The next generation heterogeneous network is a network that integrates multiple protocols but has a complex environment, and needs to provide reliable network services to users at any time and any place. Before this is achieved, however, the network environment needs to be mature, and functions such as wireless network coverage, network self-configuration, automatic management of network devices, and the like need to be solved. In the existing network environment, it is difficult to implement a single network protocol to complete the above configuration, but the resource comprehensive scheduling of the current heterogeneous network can be implemented through some algorithms, and the efficient switching of utilizing heterogeneous network resources will gradually become a research hotspot. With the further development of wireless communication, certain requirements will be put on the scalability and flexibility of heterogeneous networks.
The reinforcement learning is used as a tool which can make a decision meeting the development environment requirement in an uncertain environment, and can carry out targeted adjustment according to the dynamic change of the network, so that the heterogeneous wireless network can become a scheme automatically adapting to the scene change of a user, and the network environment is optimized. The reinforcement learning is one of machine learning, and can finally realize maximization of a certain specific index (Reward) by continuously adjusting an Agent in an Environment (Environment), in a wireless network, due to movement of nodes and mutual interference among the nodes, the network Environment becomes complex, compared with a traditional machine learning algorithm, the reinforcement learning has higher potential and higher accuracy, the characteristic extraction is directly carried out without preprocessing data, the obtained historical data is used as training data, and the learning speed and decision performance of the reinforcement learning algorithm are effectively improved by utilizing the strong advantages of the reinforcement learning.
Disclosure of Invention
In view of the above existing features of the network, the present invention provides a method for wireless network self-selection protocol based on DDQN (Deep discovery Learning with Double Q-Learning). The method comprises the following steps: a processing scheme of network quality data; a feature extraction scheme based on deep learning; network protocol selection scheme based on DDQN. The purpose of the invention is realized by the following technical scheme.
A method for wireless network self-selection protocol based on DDQN, the method includes the following steps:
1) acquiring a current network environment quality parameter and determining a node service type in real time through an environment agent module;
2) carrying out noise reduction and normalization processing on data on the basis of 1), determining the node service type through an analytic hierarchy process, and carrying out feature extraction;
3) and 2) inputting data into the DDQN decision network for real-time training on the basis of the data, and applying an execution result to enable the network state to tend to be stable.
1. A method for wireless network self-selection protocol based on DDQN is characterized by comprising the following steps:
the first step is as follows: acquiring current network environment quality parameters and node service type determination states, actions and reward values in real time through an environment agent module;
state space definition: the state space S of a terminal at time t is defined as SmnThe e belongs to S and represents the state of the terminal m when accessing the nth network and performing information interaction in the network; the state space is:
S=s1,s2,…,smn (1)
and (3) state definition: the network state is described by using average throughput T, delay D, signal strength P, and node distance W, and then the network quality Φ is expressed as:
Φ=T×D×P×W (2)
and (3) defining an action space: an action space needs to be set for the agent to select, and the definition of the action space is as follows:
A=a1,a2,…,an (3)
wherein a isnIndicating that a node uses the nth network protocol;
the access service network parameters are composed of QoS parameters, a decision matrix is established for the network QoS, and the parameter weight is solved:
the decision matrix is as shown above, wherein each element represents the importance of the QoS parameter, and is defined as the following table, and the decision matrix should satisfy mij>0;mji=1/mij;mij=1;
2, 4, 6 and 8 which are not shown in the table are used for representing intermediate values of adjacent judgment; since the service types are classified into 4 types in the process of defining the reward value and three attributes of throughput, time delay and signal strength are considered, the decision matrix should be defined as a matrix of 3 x 3, namely Mi∈R3×3Wherein, i is 1, 2, 3, 4 to represent 1 type, 2 type, 3 type, 4 four kinds of business types separately, then set up the decision matrix to four kinds of business separately according to the demand of different business QoS parameter;
according to the current network service type division standard RFC2474, determining an attribute value in a service level through a DSCP; DSCP determines IP priority by encoding values using used 6 bits and unused 2 bits in the service class TOS flag byte of each packet IP header; the IP priority field can be applied to flow classification, the larger the numerical value is, the higher the priority is, the value is from 0 to 63, 64 classes can be matched, the IP priority field is divided into one class according to the class size in every seven classes, namely, the relation between the service attribute and the parameter can be determined by sending the DSCP field in the IP data packet;
for the four types of services, i takes values of 1, 2, 3 and 4 in sequence; will be maximum specificFeature vector normalization of feature values, i.e.Each value in the normalized feature vector is the weight of the corresponding network QoS parameter; in the above four cases, differences of different service types on the requirements of network parameters are generated, and the differences will affect the division of the reward value weights; considering the whole network as a whole, the final goal is to optimize the whole network quality by selecting nodes to use the protocol, and the reward value is a function which is strongly related to the network;
Vt=v1,v2,…,vn (5)
t represents the state information of the network at time t, VtIs a subset Φ of the network state space, thus, for a particular service B, the network space state VtThe reward function R is expressed and will be solved in the next step:
R=fB(Vt) (6)
the access of the node can affect the change of network parameters, and after the action is executed, the network state needs to be measured and corresponding rewards need to be fed back; when the executed action causes the network throughput to be increased, the time delay to be reduced and the signal strength to be enhanced, the action is an effective action; conversely, when the executed action causes the reduction of network throughput, the reduction of time delay and the reduction of signal strength, the action is an invalid action; thus the average throughput a is taken into account when calculating the rewardavgAverage time delay betaavgSignal intensity γ;
the second step is that: carrying out normalization processing on data on the basis of 1), determining the node service type and determining a reward function;
the influence of data due to unit difference is eliminated by using min-max standardization:
the normalized average throughput of the network is obtained by normalizing by using the equationQuantity ft(α)avgAverage time delay ft(β)avgSignal strength ft(γ);
The reward function is obtained by combining the formula:
R=ω1ft(α)avg+ω2ft(β)avg+ω3ft(γ) (8)
wherein ω is1、ω2、ω3The weights of the average throughput, the time delay and the signal intensity of the network corresponding to the feature vector after the normalization of the decision matrix;
the third step: on the basis of 2), inputting data into a DDQN decision network for real-time training, and applying an execution result to enable the network state to tend to be stable;
firstly, initializing a state S and an action space A, initializing a Q matrix to be a zero matrix, initializing a Q-MainNet network and a Q-target network by using a random parameter theta, wherein the theta is a network parameter, randomly setting the Q-MainNet theta during initialization, and setting the Q-target theta-T represents the current time state, and the intelligent module reads the current network state information StIt is input to the Q-MainNet network, at StOutputting Q values of different actions in the state through a Q-MainNet network; according to the epsilon-greedy strategy, the Q-MainNet network randomly selects an action a according to the probability epsilontE.g. A, or select an action with a probability of 1-epsilon The terminal executes corresponding actions in the heterogeneous wireless network, processes the actions into a format required by an algorithm through acquiring network data, and sends the format to the control layer for processing; thereby obtaining throughput alpha, time delay beta and signal strength gamma; then normalizing them separately; f is obtained by analytic hierarchy process according to service typet(α)avg、ft(β)avg、ft(γ) and then weighted and summed to obtain the reward value R; Q-MainNet obtains system state and reward value through publicFormula (9)
Performing a prize value calculation wherein Rt+1Is corresponding to St+1Calculating the reward under the state, wherein gamma is an attenuation coefficient, and the reward value of the intelligent agent under the current state is actually all possible reward values in the future and is converted into the reward value at the moment; after the action is executed, the system enters the next state St+1;
Q-MainNet network to memorize group(s)t,at,rt,st+1) I.e. the current state stMotion space atCurrent prize value rtAnd t +1 network states are stored in an experience pool, from which the Q-target network randomly samples at each step, and together with the output of the Q-MainNet network calculates the difference of the loss values with respect to the parameter θ between the two networks Q, i.e. (TargetQ-Q (S)t+1,a;θt))2Executing a gradient descent algorithm; after each iteration, copying the parameters of the Q-MainNet network to the Q-target network; the training is continuously performed in a circulating way.
Drawings
Fig. 1 is an overall flow diagram of a method of DDQN based wireless network self-selection protocol;
FIG. 2 is a running diagram of the DDQN algorithm;
Detailed Description
The specific steps of the method for selecting a protocol based on DDQN for a wireless network implemented according to the present invention are described below with reference to fig. 1 as follows:
the first step is as follows: acquiring current network environment quality parameters and node service type determination states, actions and reward values in real time through an environment agent module;
to use the reinforcement learning algorithm, state, action and reward value are defined, and the network quality parameter is input as the state value.
State space definition: the state space S of a terminal at time t is defined as SmnBelongs to S and represents a terminalAnd m is accessed to the nth network and carries out information interaction in the network. The state space is:
S=s1,s2,…,smn (1)
and (3) state definition: description of network indexes in a heterogeneous network generally uses throughput, delay, packet loss rate, network load, etc. to describe a network service state, uses network signal strength, node distance, node power consumption, cost, signal-to-noise ratio to describe user characteristics, and uses average throughput T, delay D, signal strength P, and node distance W to describe a network state, then network quality Φ may be expressed as:
Φ=T×D×P×W (2)
and (3) defining an action space: an action space needs to be set for the agent to select, and the definition of the action space is as follows:
A=a1,a2,…,an (3)
wherein a isnIndicating that a node uses the nth network protocol.
Reward value definition: each node has the characteristics of respective specific service when being created, and has the service type thereof, and even under the same network environment, the nodes correspondingly have different reward values. Combining with actual requirements, the node service types are divided into the following categories:
1. the real-time requirement is high, the time delay is as low as possible, the transmission rate is high, and if the time delay is too large, the service implementation is affected. And a certain throughput is also required to ensure the reliability of data.
2. The requirement on throughput is extremely high, the requirement on real-time performance is not strong compared with the requirement on service 1, and a large data flow is required.
3. The requirement on time delay is high, network flow under emergency needs to be dealt with, time delay is reduced as much as possible, and user experience is improved.
4. Only sufficient throughput needs to be guaranteed.
The access service network parameters are composed of QoS parameters, a decision matrix is established for the network QoS, and the parameter weight is solved:
the decision matrix is shown as a formula, wherein each element represents the importance degree of the QoS parameter, specifically as table definition, and the decision matrix should satisfy mij>0;mji=1/mij;mij=1。
TABLE 1 relationship of Properties to parameters
2, 4, 6, 8, which are not shown in Table 1, are used to indicate intermediate values of the adjacent judgment. Since the service types are classified into 4 types in the process of defining the reward value and three attributes of throughput, time delay and signal strength are considered, the decision matrix should be defined as a matrix of 3 x 3, namely Mi∈R3×3Wherein, i ═ 1, 2, 3, 4 represent four kinds of business types of 1 type, 2 type, 3 type, 4 type separately, then set up the decision matrix to four kinds of business separately according to the demand of different business QoS parameter.
According to the current network service type classification standard RFC2474, attribute values in a service class are determined by dscp (differentiated Services Code point). The DSCP determines the IP priority by encoding values using the used 6 bits and the unused 2 bits in the class of service TOS flag byte of each packet IP header. The IP priority field can be applied to flow classification, the larger the numerical value is, the higher the priority is, the value is from 0 to 63, 64 classes can be matched, the IP priority field is divided into one class according to the class size in every seven classes, namely, the relation between the service attribute and the parameter can be determined by sending the DSCP field in the IP data packet.
For these four types of services, i takes values of 1, 2, 3, and 4 in sequence. Normalizing the eigenvector corresponding to the largest eigenvalue, i.e.Each value in the normalized feature vector is the weight of the corresponding network QoS parameter. In the above four cases, differences in the requirements of different traffic types on network parameters will occur, and these differences will then affect the partition of the bonus value weights. Considering the entire network as a whole, the ultimate goal will be to optimize the overall network quality by selecting nodes to use the protocol, with the reward value being a function that has a strong correlation to the network.
Vt=v1,v2,…,vn (5)
t represents the state information of the network at time t, VtIs a subset Φ of the network state space, thus, for a particular service B, the network space state VtThe reward function R is expressed and will be solved in the next step:
R=fB(Vt) (6)
the access of the node will affect the change of the network parameters, and after the action is executed, the network state needs to be measured and corresponding rewards need to be fed back. When the executed action causes the network throughput to be increased, the time delay to be reduced and the signal strength to be enhanced, the action is an effective action; conversely, an invalid action is determined when the executed action causes the network throughput to be reduced, the time delay to be reduced and the signal strength to be reduced. Thus the average throughput a is taken into account when calculating the rewardavgAverage time delay betaavgSignal intensity γ.
The second step is that: carrying out normalization processing on data on the basis of 1), determining the node service type and determining a reward function;
different network parameters have large difference in unit and value, normalization processing is needed, all values are linearly transformed, and the values are mapped to [0,1 ].
The influence of data due to unit difference is eliminated by using min-max standardization:
the normalized average throughput f of the network is obtained by normalization with the equationt(α)avgAverage time delay ft(β)avgSignal strength ft(γ)。
Combining the above equations can derive the reward function:
R=ω1ft(α)avg+ω2ft(β)avg+ω3ft(γ) (8)
wherein ω is1、ω2、ω3And the weights of the average throughput, the time delay and the signal intensity of the network corresponding to the feature vector after the normalization of the decision matrix.
The third step: and 2) inputting data into the DDQN decision network for real-time training on the basis of the data, and applying an execution result to enable the network state to tend to be stable.
One of the biggest disadvantages of using DQN is that although the argmax () method can get Q values close to the target quickly, it is likely to result in overestimation, which is the large deviation of the algorithm model we get. To solve this problem, errors can be eliminated by separating the target Q value calculation and the target Q value selection. The network information is in a discrete state, and the DDQN can well process data in the discrete state.
With reference to fig. 2, two neural networks are used in DQN for implementation, Q-MainNet and Q-target respectively. In the same manner, the DDQN uses two networks for operation, but the target Q value is calculated differently.
Firstly, initializing a state S and an action space A, initializing a Q matrix to be a zero matrix, initializing a Q-MainNet network and a Q-target network by using a random parameter theta, wherein the theta is a network parameter, randomly setting the Q-MainNet theta during initialization, and setting the Q-target theta-T represents the current time state, and the intelligent module reads the current network state information S)It is input to the Q-MainNet network, at StAnd outputting the Q values of different actions in the state through the Q-MainNet network. According to the epsilon-greedy strategy, the Q-MainNet network randomly selects an action a according to the probability epsilontE.g. A, or select an action with a probability of 1-epsilon The terminal executes corresponding actions in the heterogeneous wireless network, processes the actions into a format required by the algorithm after acquiring network data, and sends the format to the control layer for processing. Thereby obtaining the throughput alpha, the time delay beta and the signal strength gamma. They are then normalized separately. F is obtained by analytic hierarchy process according to service typet(α)avg、ft(β)avg、ftThe weight of (gamma) and the weighted sum then yields the reward value R. Q-MainNet obtains the system status and the reward value through the formula (9)
Performing a prize value calculation wherein Rt+1Is corresponding to St+1The calculated reward under the state, gamma is a decay coefficient, and the reward value of the intelligent agent under the current state is actually all possible reward values in the future and is converted into the reward value at the moment. After the action is executed, the system enters the next state St+1。
Q-MainNet network to memorize group(s)t,at,rt,st+1) I.e. the current state stMotion space atCurrent prize value rtAnd t +1 network states are stored in an experience pool, from which the Q-target network randomly samples at each step, and together with the output of the Q-MainNet network calculates the difference of the loss values with respect to the parameter θ between the two networks Q, i.e. (TargetQ-Q (S)t+1,a;θt))2And executing a gradient descent algorithm. And copying the parameters of the Q-MainNet network to the Q-target network every G steps. The training is continuously performed in a circulating way.
Claims (1)
1. A method for wireless network self-selection protocol based on DDQN is characterized by comprising the following steps:
the first step is as follows: acquiring current network environment quality parameters and node service type determination states, actions and reward values in real time through an environment agent module;
state space definition: the state space S of a terminal at time t is defined as SmnThe e belongs to S and represents the state of the terminal m when accessing the nth network and performing information interaction in the network; the state space is:
S=s1,s2,...,smn (1)
and (3) state definition: the network state is described by using average throughput T, delay D, signal strength P, and node distance W, and then the network quality Φ is expressed as:
Φ=T×D×P×W (2)
and (3) defining an action space: an action space needs to be set for the agent to select, and the definition of the action space is as follows:
A=a1,a2,...,an (3)
wherein a isnIndicating that a node uses the nth network protocol;
the access service network parameters are composed of QoS parameters, a decision matrix is established for the network QoS, and the parameter weight is solved:
the decision matrix is as shown above, wherein each element represents the importance of the QoS parameter, and is defined as the following table, and the decision matrix should satisfy mij>0;mji=1/mij;mij=1;
2, 4, 6 and 8 which are not shown in the table are used for representing intermediate values of adjacent judgment; since the service types are classified into 4 types in the process of defining the reward value and three attributes of throughput, time delay and signal strength are considered, the decision matrix should be defined as a matrix of 3 x 3, namely Mi∈R3×3Wherein i is 1, 2, 3, 4Respectively representing four service types of 1 type, 2 type, 3 type and 4 type, and then respectively establishing a decision matrix for the four services according to the requirements of different service QoS parameters;
according to the current network service type division standard RFC2474, determining an attribute value in a service level through a DSCP; DSCP determines IP priority by encoding values using used 6 bits and unused 2 bits in the service class TOS flag byte of each packet IP header; the IP priority field can be applied to flow classification, the larger the numerical value is, the higher the priority is, the value is from 0 to 63, 64 classes can be matched, the IP priority field is divided into one class according to the class size in every seven classes, namely, the relation between the service attribute and the parameter can be determined by sending the DSCP field in the IP data packet;
for the four types of services, i takes values of 1, 2, 3 and 4 in sequence; normalizing the eigenvector corresponding to the largest eigenvalue, i.e.Each value in the normalized feature vector is the weight of the corresponding network QoS parameter; in the above four cases, differences of different service types on the requirements of network parameters are generated, and the differences will affect the division of the reward value weights; considering the whole network as a whole, the final goal is to optimize the whole network quality by selecting nodes to use the protocol, and the reward value is a function which is strongly related to the network;
Vt=v1,v2,…,vn (5)
t represents the state information of the network at time t, VtIs a subset Φ of the network state space, thus, for a particular service B, the network space state VtThe reward function R is expressed and will be solved in the next step:
R=fB(Vt) (6)
the access of the node can affect the change of network parameters, and after the action is executed, the network state needs to be measured and corresponding rewards need to be fed back; when the executed action causes the network throughput to be increased, the time delay to be reduced and the signal intensity to be strongDegree enhancement is effective action; conversely, when the executed action causes the reduction of network throughput, the reduction of time delay and the reduction of signal strength, the action is an invalid action; thus the average throughput a is taken into account when calculating the rewardavgAverage time delay betaavgSignal intensity γ;
the second step is that: carrying out normalization processing on data on the basis of 1), determining the node service type and determining a reward function;
the influence of data due to unit difference is eliminated by using min-max standardization:
the normalized average throughput f of the network is obtained by normalization with the equationt(α)avgAverage time delay ft(β)avgSignal strength ft(γ);
The reward function is obtained by combining the formula:
R=ω1ft(α)avg+ω2ft(β)avg+ω3ft(γ) (8)
wherein ω is1、ω2、ω3The weights of the average throughput, the time delay and the signal intensity of the network corresponding to the feature vector after the normalization of the decision matrix;
the third step: on the basis of 2), inputting data into a DDQN decision network for real-time training, and applying an execution result to enable the network state to tend to be stable;
firstly, initializing a state S and an action space A, initializing a Q matrix to be a zero matrix, initializing a Q-MainNet network and a Q-target network by using a random parameter theta, wherein the theta is a network parameter, randomly setting the Q-MainNet theta during initialization, and setting the Q-target theta-T represents the current time state, and the intelligent module reads the current network state information StIt is input to the Q-MainNet network, at StOutputting Q values of different actions in the state through a Q-MainNet network; according to the epsilon-greedy strategy, the Q-MainNet network follows the probability epsilonThe machine selects an action atE.g. A, or select an action with a probability of 1-epsilon The terminal executes corresponding actions in the heterogeneous wireless network, processes the actions into a format required by an algorithm through acquiring network data, and sends the format to the control layer for processing; thereby obtaining throughput alpha, time delay beta and signal strength gamma; then normalizing them separately; f is obtained by analytic hierarchy process according to service typet(α)avg、ft(β)avg、ft(γ) and then weighted and summed to obtain the reward value R; Q-MainNet obtains the system status and the reward value through the formula (9)
Performing a prize value calculation wherein Rt+1Is corresponding to St+1Calculating the reward under the state, wherein gamma is an attenuation coefficient, and the reward value of the intelligent agent under the current state is actually all possible reward values in the future and is converted into the reward value at the moment; after the action is executed, the system enters the next state St+1;
Q-MainNet network to memorize group(s)t,at,rt,st+1) I.e. the current state stMotion space atCurrent prize value rtAnd t +1 network states are stored in an experience pool, from which the Q-target network randomly samples at each step, and together with the output of the Q-MainNet network calculates the difference of the loss values with respect to the parameter θ between the two networks Q, i.e. (TargetQ-Q (S)t+1,a;θt))2Executing a gradient descent algorithm; after each iteration, copying the parameters of the Q-MainNet network to the Q-target network; the training is continuously performed in a circulating way.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110249773.3A CN113055229B (en) | 2021-03-05 | 2021-03-05 | Wireless network self-selection protocol method based on DDQN |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110249773.3A CN113055229B (en) | 2021-03-05 | 2021-03-05 | Wireless network self-selection protocol method based on DDQN |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113055229A true CN113055229A (en) | 2021-06-29 |
CN113055229B CN113055229B (en) | 2023-10-27 |
Family
ID=76510598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110249773.3A Active CN113055229B (en) | 2021-03-05 | 2021-03-05 | Wireless network self-selection protocol method based on DDQN |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113055229B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103327556A (en) * | 2013-07-04 | 2013-09-25 | 中国人民解放军理工大学通信工程学院 | Dynamic network selection method for optimizing quality of experience (QoE) of user in heterogeneous wireless network |
CN105208624A (en) * | 2015-08-27 | 2015-12-30 | 重庆邮电大学 | Service-based multi-access network selection system and method in heterogeneous wireless network |
CN107889195A (en) * | 2017-11-16 | 2018-04-06 | 电子科技大学 | A kind of self study heterogeneous wireless network access selection method of differentiated service |
CN110809306A (en) * | 2019-11-04 | 2020-02-18 | 电子科技大学 | Terminal access selection method based on deep reinforcement learning |
WO2021013368A1 (en) * | 2019-07-25 | 2021-01-28 | Telefonaktiebolaget Lm Ericsson (Publ) | Machine learning based adaption of qoe control policy |
-
2021
- 2021-03-05 CN CN202110249773.3A patent/CN113055229B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103327556A (en) * | 2013-07-04 | 2013-09-25 | 中国人民解放军理工大学通信工程学院 | Dynamic network selection method for optimizing quality of experience (QoE) of user in heterogeneous wireless network |
CN105208624A (en) * | 2015-08-27 | 2015-12-30 | 重庆邮电大学 | Service-based multi-access network selection system and method in heterogeneous wireless network |
CN107889195A (en) * | 2017-11-16 | 2018-04-06 | 电子科技大学 | A kind of self study heterogeneous wireless network access selection method of differentiated service |
WO2021013368A1 (en) * | 2019-07-25 | 2021-01-28 | Telefonaktiebolaget Lm Ericsson (Publ) | Machine learning based adaption of qoe control policy |
CN110809306A (en) * | 2019-11-04 | 2020-02-18 | 电子科技大学 | Terminal access selection method based on deep reinforcement learning |
Non-Patent Citations (4)
Title |
---|
张媛媛等: "一种新的面向用户多业务QoS需求的网络接入选择算法", 《计算机科学》 * |
张媛媛等: "一种新的面向用户多业务QoS需求的网络接入选择算法", 《计算机科学》, vol. 42, no. 3, 31 March 2015 (2015-03-31) * |
马礼等: "基于Markov模型的接入网络选择算法", 《计算机工程》 * |
马礼等: "基于Markov模型的接入网络选择算法", 《计算机工程》, vol. 45, no. 5, 31 May 2019 (2019-05-31) * |
Also Published As
Publication number | Publication date |
---|---|
CN113055229B (en) | 2023-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021233053A1 (en) | Computing offloading method and communication apparatus | |
Sun et al. | Autonomous resource slicing for virtualized vehicular networks with D2D communications based on deep reinforcement learning | |
CN111866954B (en) | User selection and resource allocation method based on federal learning | |
CN111510879B (en) | Heterogeneous Internet of vehicles network selection method and system based on multi-constraint utility function | |
CN106534333A (en) | Bidirectional selection computing unloading method based on MEC and MCC | |
CN111711666B (en) | Internet of vehicles cloud computing resource optimization method based on reinforcement learning | |
WO2019184836A1 (en) | Data analysis device, and multi-model co-decision system and method | |
Zhu et al. | Adaptive multi-access algorithm for multi-service edge users in 5G ultra-dense heterogeneous networks | |
Jiang et al. | A decoupled learning strategy for massive access optimization in cellular IoT networks | |
CN107889195B (en) | Self-learning heterogeneous wireless network access selection method for distinguishing services | |
Sekaran et al. | 5G integrated spectrum selection and spectrum access using AI-based frame work for IoT based sensor networks | |
CN114465945B (en) | SDN-based identification analysis network construction method | |
CN113727420B (en) | Multimode access network selection device and method | |
CN113676357B (en) | Decision method for edge data processing in power internet of things and application thereof | |
CN114726743A (en) | Service function chain deployment method based on federal reinforcement learning | |
Rao et al. | Network selection in heterogeneous environment: A step toward always best connected and served | |
Wu et al. | Link congestion prediction using machine learning for software-defined-network data plane | |
CN112804702B (en) | Multi-link air-ground data exchange link performance evaluation method based on utility function | |
CN112508408A (en) | Mapping model construction method of wireless resource management index under edge calculation | |
CN113055229A (en) | Wireless network self-selection protocol method based on DDQN | |
CN113543160A (en) | 5G slice resource allocation method and device, computing equipment and computer storage medium | |
CN106535226A (en) | Multi-network cooperative selection and aggregation method based on synergetic theory | |
CN112260941B (en) | Heterogeneous network data fusion method | |
CN114449536A (en) | 5G ultra-dense network multi-user access selection method based on deep reinforcement learning | |
CN111723978A (en) | Index evaluation method for adapting to difference demands of various power services based on virtual mapping |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |