CN114449536A - 5G ultra-dense network multi-user access selection method based on deep reinforcement learning - Google Patents

5G ultra-dense network multi-user access selection method based on deep reinforcement learning Download PDF

Info

Publication number
CN114449536A
CN114449536A CN202210103469.2A CN202210103469A CN114449536A CN 114449536 A CN114449536 A CN 114449536A CN 202210103469 A CN202210103469 A CN 202210103469A CN 114449536 A CN114449536 A CN 114449536A
Authority
CN
China
Prior art keywords
network
attribute
user
ultra
decision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210103469.2A
Other languages
Chinese (zh)
Other versions
CN114449536B (en
Inventor
唐宏
刘小洁
陈榕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210103469.2A priority Critical patent/CN114449536B/en
Publication of CN114449536A publication Critical patent/CN114449536A/en
Application granted granted Critical
Publication of CN114449536B publication Critical patent/CN114449536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/22Traffic simulation tools or models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/0289Congestion control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W76/00Connection management
    • H04W76/10Connection setup
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention relates to the technical field of machine learning and communication, in particular to a 5G ultra-dense network multi-user access selection method based on deep reinforcement learning, which comprises the following steps: setting a threshold value according to the lowest tolerance of the user to each network attribute, and excluding networks which do not meet the requirements completely; determining the comprehensive weight of the network attribute by using an analytic hierarchy process and an entropy weight method; obtaining a weighted gray correlation coefficient by combining a gray correlation analysis method and network attribute comprehensive weight; modeling a multi-user access problem into a multi-user random game model; solving an optimal strategy by using a depth certainty strategy gradient method; the method can improve the system throughput, reduce user blockage and reduce time delay on the premise of ensuring the individual requirements of different service users.

Description

5G ultra-dense network multi-user access selection method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of machine learning and communication, in particular to a 5G ultra-dense network multi-user access selection method based on deep reinforcement learning.
Background
The 5G ultra-dense network architecture integrates different Radio Access Technologies (RATs), and will support large-scale communication connection of massive internet of things devices. However, with the rapid growth of internet of things devices, the personalization of service requirements, and the continuous deepening of network system heterogeneity, how to design an intelligent network access selection scheme for user equipment becomes a key challenge facing 5G ultra-dense networks.
Currently, researchers in this field have proposed a variety of different access selection algorithms, which can be mainly classified into the following four categories: 1) the access algorithm based on single parameter mainly takes the early access algorithm based on Received Signal Strength (RSS) as the main algorithm; 2) the access algorithm based on the multi-attribute decision (MADM) is a more common MADM algorithm such as a simple weighting method (SAW), a multiplication exponential weighting Method (MEW), an Analytic Hierarchy Process (AHP), a near ideal solution ordering method (TOPSIS), a gray correlation analysis method (GRA) and the like; 3) based on intelligent access algorithms, the intelligent algorithms mainly comprise fuzzy logic, an artificial neural network, Q learning, an artificial bee colony algorithm, a particle swarm algorithm and the like; 4) the prediction-based access algorithm includes a location information-based mobility trend prediction method and a Markov Decision Process (MDP) -based network information prediction method.
However, most of the existing network access selection methods only optimize and select the policy from the user side or the network side, which causes serious network congestion, poor user experience, and reduced system performance. Therefore, how to design a user access algorithm which comprehensively considers user preference, service demand, experience quality and network actual conditions and adapts to a dynamic complex 5G ultra-dense network environment is a problem still to be solved by current researchers.
Disclosure of Invention
Based on the problems in the prior art, the invention provides a 5G ultra-dense network multi-user access selection method based on deep reinforcement learning, which comprises the following steps:
s1: acquiring each network attribute, and setting a threshold value according to the minimum tolerance of a user to each network attribute; screening the network according to the set threshold value to obtain the network meeting the user requirement;
s2: determining the attribute comprehensive weight of each network by adopting an analytic hierarchy process and an entropy weight process;
s3: calculating a weighted gray correlation coefficient of the network by adopting a gray correlation analysis method according to the attribute comprehensive weight of the network;
s4: modeling a multi-user access problem into a multi-user random game model according to the weighted gray correlation coefficient;
s5: and solving the optimal strategy of the multi-person random game model by adopting a depth certainty strategy gradient method, and accessing the user to the network according to the optimal strategy.
Preferably, the process of screening the network includes:
s11: acquiring attributes of each network, wherein the attributes comprise bandwidth, energy efficiency, time delay, jitter and tariff;
s12: acquiring minimum tolerance of a user to different network attributes, and calculating the average value of the minimum tolerance to obtain a reference network H, wherein the average value of the minimum tolerance of the reference network is a set threshold;
s13: and comparing the candidate network with the reference network H, deleting the network in the candidate network if one attribute in the candidate network is lower than the set threshold, and keeping the network in the candidate network if all the attributes in the candidate network are greater than or equal to the set threshold.
Preferably, the process of determining the attribute comprehensive weight of each network by using the analytic hierarchy process and the entropy weight process includes:
s21: constructing a hierarchical model, wherein the hierarchical model comprises a target layer, a criterion layer and a scheme layer, the target layer represents an optimal network which a user requesting a specific service desires to access, the criterion layer represents the requirement of service request service on network parameters, and the scheme layer represents a candidate network in a system model;
s22: comparing every two attributes of the same layer network, and constructing a decision attribute comparison matrix by adopting a 1-9 scale assignment method;
s23: normalizing the decision attribute comparison matrix to obtain a network attribute weight;
s24: calculating the consistency ratio of the matrix according to the decision attribute comparison matrix, setting a consistency ratio threshold, if the consistency ratio of the matrix is less than or equal to the set consistency ratio threshold, the network attribute weight is reasonable, otherwise, reconstructing the decision attribute comparison matrix until the consistency ratio threshold is met;
s25: constructing an attribute decision matrix, and standardizing the matrix; normalizing the standardized attribute decision matrix;
s26: calculating the information entropy of the attribute decision matrix after normalization processing;
s27: calculating the entropy weight of the network attribute according to the information entropy to obtain an objective weight of the network attribute;
s28: and processing the network attribute weight and the objective weight by adopting a product method, and normalizing the processed weights to obtain the comprehensive weight of the network attribute.
Further, the expression of the comprehensive weight of the network attribute is as follows:
Figure BDA0003493064740000031
wherein ,
Figure BDA0003493064740000032
representing network attribute weight, i representing network decision attribute type, k representing user service type, betaiThe objective weights representing the network attributes, and l represents the total number of each network decision attribute type.
Preferably, the process of calculating the weighted gray correlation coefficient of the network includes:
s31: constructing a network attribute matrix E;
s32: standardizing and normalizing the network attribute matrix to obtain a normalized network attribute matrix E';
s33: determining the attribute type of the network, and determining the attribute of the optimal reference network according to the attribute type;
s34: respectively calculating the correlation between different attributes in the candidate network and the corresponding attributes of the reference network to obtain grey correlation coefficients;
s35: and multiplying the gray correlation coefficient by the attribute comprehensive weight to obtain a weighted gray correlation coefficient.
Further, the gray correlation coefficient is calculated by the following formula:
Figure BDA0003493064740000033
wherein ρ represents a resolution coefficient, e'ojRepresents the jth attribute, e 'in the reference network'ijAnd elements of the ith row and the jth column in the normalized network attribute matrix are represented.
Preferably, the process of constructing the multi-person random game model comprises the following steps: obtaining an environmental State S, a set of actions A1,…,ANAnd a set of agents partial observation sets O of the environment1,…,ON, wherein
Figure BDA0003493064740000034
Figure BDA0003493064740000041
Indicating the number n of end users connected to the network i providing the service k, a ═ { a ═ 1,2 …, m }, a indicating one of the candidate networks selected by the user terminal in the specific network state; each agent according to its own policy
Figure BDA0003493064740000042
Selection action AnThe policy function is expressed as
Figure BDA0003493064740000043
Determining the next state of the environment according to the selection action of the agent and the transition probability, wherein the transition probability is P: S multiplied by A1×…×AN→ S; each agent selection action is followed by a reward r from the environmentnAnd new partial observation data o of the environmentnThe reward function is expressed as rn:S×An→ R, the observation function is denoted on:S→On(ii) a According to weighted gray phasesRelationship calculation network utility function capable of representing user QoS
Figure BDA0003493064740000044
The bonus function r of each user terminaln(on,an) Q (1- η), where η represents the network load and the reward function per user terminal is
Figure BDA0003493064740000045
The joint reward function of the user terminal is expressed as
Figure BDA0003493064740000046
Wherein γ represents a discount coefficient, and T represents a time range; the ultimate goal of the agent is to find the optimal strategy to accumulate the expected return RnMaximum, is expressed as
Figure BDA0003493064740000047
Figure BDA0003493064740000048
wherein ρπRepresenting a state distribution function.
Preferably, the process of solving the optimal strategy includes:
s51: constructing an evaluation network and a strategy network; the evaluation network comprises a state estimation network and a state reality network, and the strategy network comprises an action estimation network and an action reality network;
s52: storing the switching decision data into a space D for repeated learning and training of an evaluation network and a strategy network; switching decision data into a tuple
Figure BDA0003493064740000049
The environment state of the decision time t, the actions and rewards of all agents and the environment state of the next decision time t +1 are included;
s53: randomly extracting M groups of switching data in the storage space D as training samples, and inputting the training samples into a state estimation network for training to obtain a trained state estimation network;
s54: updating the action estimation network in a strategy gradient mode; calculating strategy gradient, and estimating network parameters by using gradient ascent method
Figure BDA00034930647400000410
Updating towards the direction of increasing the value function, and acquiring the Q value of the state action pair through the trained state estimation network;
s55: respectively carrying out soft updating on the state reality network and the action reality network by adopting a moving average method according to the parameters of the state estimation network and the action estimation network;
s56: performing loop iteration for L times, wherein each loop iteration starts from the initial moment to obtain an initialized network state till the termination moment; performing operation steps S53-S55 at each decision moment, and training and updating network parameters; and determining an optimal access strategy according to the trained strategy network.
To achieve the above object, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements any one of the above 5G ultra-dense network multi-user access selection methods based on deep reinforcement learning.
In order to achieve the above object, the present invention further provides a 5G super-dense network multi-user access selection device based on deep reinforcement learning, which includes a processor and a memory; the memory is used for storing a computer program; the processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the 5G ultra-dense network multi-user access selection device based on deep reinforcement learning to execute any one of the 5G ultra-dense network multi-user access selection methods based on deep reinforcement learning.
The invention has the beneficial effects that:
aiming at the problem of multi-user access selection under a 5G ultra-dense network, the invention optimizes network selection from two aspects of users and networks; the network is preliminarily screened before the network decision-making, so that the aim of reducing the switching times is fulfilled; in order to better meet the differentiated service requirements and efficiently ensure the user experience, a Markov decision model is utilized to model the network selection problem, an analytic hierarchy process, an entropy weight process and a gray correlation analysis process are jointly adopted, and the user preference of each network is realized under the condition that the network conditions allow; aiming at the complex decision problem in the dynamic network, the optimal strategy is solved by adopting a deep deterministic strategy gradient method, so that the user congestion is reduced and the system throughput is improved on the premise of ensuring the user experience.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a block diagram of a hierarchy for determining subjective weighting of parameters in accordance with the present invention;
FIG. 3 is a diagram of a system model of the present invention;
FIG. 4 is a diagram of the structure of MADDPG according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A 5G ultra-dense network multi-user access selection method based on deep reinforcement learning, as shown in fig. 1, the method includes:
s1: acquiring each network attribute, and setting a threshold value according to the lowest tolerance of a user to each network attribute; screening the network according to the set threshold value to obtain the network meeting the user requirement;
s2: determining the attribute comprehensive weight of each network by adopting an analytic hierarchy process and an entropy weight process;
s3: calculating a weighted gray correlation coefficient of the network by adopting a gray correlation analysis method according to the attribute comprehensive weight of the network;
s4: modeling a multi-user access problem into a multi-user random game model according to the weighted gray correlation coefficient;
s5: and solving the optimal strategy of the multi-person random game model by adopting a depth certainty strategy gradient method, and accessing the user to the network according to the optimal strategy.
In a possible implementation, the step of S1, setting a threshold according to the minimum tolerance of the user to each network attribute, and excluding completely unsatisfactory networks, includes:
s11: acquiring the attribute of each network; the network attributes include: bandwidth, energy efficiency, latency, jitter, and tariff.
S12: acquiring the minimum tolerance of a user to different network attributes, and calculating the average value of the minimum tolerance to obtain a reference network H, wherein the average value of the minimum tolerance of the reference network is a set threshold; specifically, the lowest tolerance of the user to different network attributes is collected, and an average value is taken to obtain a reference network H, which is specifically represented as:
Figure BDA0003493064740000061
wherein N belongs to N and represents the nth user, N is the total number of users, j represents the jth network attribute, h'jRepresents a threshold, i.e. the average minimum tolerance of the user to the network property j.
S13: and comparing the candidate network with the reference network H, deleting the network in the candidate network if one attribute in the candidate network is lower than the set threshold, and keeping the network in the candidate network if all the attributes in the candidate network are greater than or equal to the set threshold.
In one possible embodiment, as shown in fig. 2, the step of determining subjective weights of network attributes by using an analytic hierarchy process in S2 includes:
and (5) constructing a hierarchical model. The decision scheme is divided into three layers: a target layer (representing the best network a user requesting a particular service desires to access), a criteria layer (representing the requirements of the service request service on network parameters), a plan layer (representing candidate networks in the system model).
Respectively constructing judgment comparison matrixes of different Internet of things services according to the importance degree of different network attributes in network selection, wherein the services comprise intelligent health services, VR & AR services and industrial mechanical services, and are expressed as K ∈ K ═ {1,2,3}, pairwise comparison is carried out between the attributes of the same layer network, a 1-9 scale assignment method is adopted to construct a comparison judgment matrix, and the expression is as follows:
Figure BDA0003493064740000071
wherein ,
Figure BDA0003493064740000072
represents the ith attribute c under the requirement of user service kiRelative to jth attribute cjThe comparison scale value of (a), l represents the total type number of the network attributes,
Figure BDA0003493064740000073
the values of (A) are shown in Table 1.
TABLE 1
Figure BDA0003493064740000074
As an emerging medical application, intelligent health services provide telemedicine for patients, and particularly telesurgery, and precise remote operation needs to be performed in a stable environment. Thus, the intelligent health service has a high demand for delay and jitter, and thus, a decision matrix for the intelligent health service is constructed
Figure BDA0003493064740000075
As shown in table 2.
TABLE 2
Figure BDA0003493064740000076
Figure BDA0003493064740000081
VR&Typical application scenarios for AR services include holographic navigation and intelligent VR gamingHigh bandwidth is required to provide a good experience for the user, therefore, constructing VR&Decision matrix for AR services
Figure BDA0003493064740000082
As shown in table 3.
TABLE 3
Figure BDA0003493064740000083
The application scene of the industrial mechanical service comprises real-time monitoring of production equipment and remote control of construction machinery, and the requirement on delay is high, so that a judgment matrix of the industrial mechanical service is constructed
Figure BDA0003493064740000084
As shown in table 4.
TABLE 4
Figure BDA0003493064740000085
Figure BDA0003493064740000091
Will be provided with
Figure BDA0003493064740000092
Each column of (a) is normalized to obtain a normalized matrix, the expression of which is:
Figure BDA0003493064740000093
calculating the weight value according to the normalized decision attribute, wherein the weight value is a subjective weight and the expression is as follows:
Figure BDA0003493064740000094
defining a consistency index
Figure BDA0003493064740000095
Represents the decision matrix CkIs given as the maximum feature root, the consistency ratio is expressed as
Figure BDA0003493064740000096
Where RI is a randomly generated consistency index, which can be obtained from table 5. And when CR is less than or equal to 0.1, the weight is considered to be reasonable, otherwise, the judgment matrix needs to be reconstructed until the condition is met.
TABLE 5
n 1 2 3 4 5 6 7 8 9
RI 0 0 0.52 0.89 1.12 1.26 1.36 1.41 1.46
In a possible implementation, in S2, the solving the objective network attribute weight by using the entropy weight method includes:
constructing an attribute decision matrix X ═ (X)ij)m×lNormalized, and is noted as B ═ Bij)m×l
For benefit type attributes (such as bandwidth, energy efficiency, etc.), the normalized expression is:
Figure BDA0003493064740000101
for cost-type attributes (such as delay, jitter, etc.), the normalized expression is:
Figure BDA0003493064740000102
the judgment matrix is normalized, and the normalized value of the j attribute of the ith network is
Figure BDA0003493064740000103
From this, the information entropy of the network attributes can be determined
Figure BDA0003493064740000104
wherein
Figure BDA0003493064740000105
Calculating entropy weight according to information entropy to obtain objective weight beta of network attributejIs specifically shown as
Figure BDA0003493064740000106
And obtaining the comprehensive weight of the network attribute by adopting a product method and normalization, wherein the comprehensive weight is expressed as:
Figure BDA0003493064740000107
wherein ,
Figure BDA0003493064740000108
representing network attribute weight, i representing network decision attribute type, k representing user service type, betaiThe objective weights representing the network attributes, and l represents the total number of each network decision attribute type.
In a possible embodiment, the step of S3 obtaining the weighted gray correlation coefficient by combining the gray correlation analysis method and the S network attribute comprehensive weight includes:
s31: constructing a network attribute matrix E, wherein the constructed expression is as follows:
Figure BDA0003493064740000109
wherein, the number of candidate networks is represented, l represents the total number of each network decision attribute type, eijRepresenting an attribute j of network i.
S32: normalizing and normalizing the network attributes by using the method mentioned in step S2 to obtain a matrix E ', the expression of the matrix E' is:
Figure BDA0003493064740000111
s33: and determining the attribute type of the network, and determining the attribute of the optimal reference network according to the attribute type. When the attribute is a benefit type attribute, the mostThe optimal attribute value is the maximum value in each column, when the attribute is the cost type attribute, the optimal attribute is the minimum value in each column and is marked as xoIs shown as
Figure BDA0003493064740000112
S34: and respectively calculating the correlation between different attributes in the candidate network and the corresponding attributes of the reference network to obtain a grey correlation coefficient. The expression for the grey correlation coefficient is:
Figure BDA0003493064740000113
where ρ represents a resolution coefficient, and ρ ∈ [0,1 ].
S35: and multiplying the gray correlation coefficient by the attribute comprehensive weight to obtain a weighted gray correlation coefficient. The expression for the weighted gray correlation coefficient is:
Figure BDA0003493064740000114
in a possible implementation, the modeling of the multi-user access problem as the multi-user random game model in S4 includes:
in order to distinguish the previous research on the heterogeneous network selection model, reflect the density and the heterogeneous type of 5G, and construct a heterogeneous network system model consisting of 5G, LTE, WiMAX, UMTS BS and two Wi-Fi 6 APs, i.e. m is 6, as shown in fig. 3. In a heterogeneous network environment, 100 user terminals are randomly generated, the terminals randomly generate K which is 3 service types, the arrival and departure of the terminals are distributed according to Poisson, and the service time is distributed according to a negative index.
The Markov Decision Process (MDP) is extended to a partially observable Markov decision process for multi-agents. The Markov game of N agents consists of a set of environmental states S and a set of actions A1,…,ANAnd a set of agents partial observation sets O of the environment1,…,ONAs defined. Each agent according to its own policy
Figure BDA0003493064740000115
Selection action AnThe policy function is expressed as
Figure BDA0003493064740000116
The environment is transferred to the next state with a certain probability according to the action selected by the agent, and the transfer probability is expressed as P: S X A1×…×AN→ S. Each agent selection action is followed by a reward r from the environmentnAnd new partial observation data o of the environmentnThe reward function is expressed as rn:S×An→ R, the observation function is denoted on:S→On. The initial state of the environment is determined by the distribution function ρ, denoted ρ: S → [0,1]. The goal of each agent n is to maximize its own total expected return, with the return function being
Figure BDA0003493064740000121
Where γ denotes a discount coefficient and T denotes a time range.
In the multi-user access selection problem, the arrival and departure of a user will cause a change in the environmental state, thus defining the environmental state S as
Figure BDA0003493064740000122
wherein
Figure BDA0003493064740000123
Representing the number n of end users connected to the network i providing the service k.
In the multiple-user access selection problem, the user terminal takes a network selection action in each environment state, so the action of each agent in the model may be defined as a ═ { a ═ 1,2 …, m }, which represents a certain candidate network selected by the user terminal in a specific network state.
In the multi-user access selection problem, the user terminal's goal is to select the network that accumulates the greatest utility of the expected network. Therefore, the utility of the network capable of representing the QoS of the user is calculated according to the weighted gray correlation coefficientFunction(s)
Figure BDA0003493064740000124
From this the reward function r for each user terminal can be derivedn(on,an) Q (1- η), where η represents the network load, the reward function per user terminal is then
Figure BDA0003493064740000125
Thus, the joint reward function of the user terminal is expressed as
Figure BDA0003493064740000126
The ultimate goal of the agent is to find the optimal strategy to accumulate the expected return RnMaximum, is expressed as
Figure BDA0003493064740000127
wherein ρπRepresenting a state distribution function.
In a possible implementation, as shown in fig. 4, the step S5 of solving the optimal policy by using a depth deterministic policy gradient method includes:
the evaluation network and the strategy network are constructed by adopting a BP neural network, and the neural network in the embodiment is composed of an input layer, a hidden layer and an output layer. Each agent has its own evaluation network and policy network.
The user terminal continuously interacts with the environment: in an initial state stThen, the deterministic strategy output by the strategy network is converted into a random process selection action
Figure BDA0003493064740000128
After each agent executes the action, the agent obtains corresponding reward from the environment
Figure BDA0003493064740000131
And the next state s of the networkt+1Then the next step is entered, and the loop is continued until the moment when the connection is terminated. And then entering the next epsilon and continuously circulating. Will interact with each otherData generated by the process
Figure BDA0003493064740000132
Storing the data into the space D for repeated learning and training of the neural network;
when the number of samples stored in the storage space D meets the requirement of neural network training on the data volume, after each interaction between the agent and the environment, a small batch of switching samples are randomly sampled and extracted from the storage space D and used as training data of the state estimation network and the action estimation network.
Step 1: training a state estimation network by adopting a random gradient descent method;
a loss function is constructed using a mean square error model, expressed as:
Figure BDA0003493064740000133
wherein
Figure BDA0003493064740000134
Which is an estimate, is the output of the state estimation network,
Figure BDA0003493064740000135
Figure BDA0003493064740000136
is a local view of the environment by all agents, ytThe true value of the sample, considered as a label, is expressed as:
Figure BDA0003493064740000137
wherein gamma (0 is more than or equal to gamma and less than or equal to 1) is a discount factor,
Figure BDA0003493064740000138
is the Q value estimated by the state reality network,
Figure BDA0003493064740000139
is a deterministic policy output by the action reality network.
And according to the constructed loss function, iteratively solving a minimized extreme value of the loss function by using a gradient descent method through back propagation so as to update the state estimation network.
Step 2: and training the motion estimation network by adopting a random gradient ascent method.
After the state estimation network is updated, the action estimation network is updated in a strategy gradient mode, wherein the strategy gradient is expressed as:
Figure BDA00034930647400001310
after obtaining the strategy gradient, updating the action estimation network parameter towards the direction of increasing the value function by adopting a gradient ascending method
Figure BDA00034930647400001311
And acquiring the Q value of the state action pair through the state estimation network.
And step 3: and updating the state reality network and the action reality network by adopting a soft updating method.
Using a moving average method to respectively perform soft update on the parameters of the state estimation network and the motion estimation network on the real network and the motion real network, wherein the soft update is represented as:
Figure BDA0003493064740000141
Figure BDA0003493064740000142
where α is the learning rate.
Performing loop iteration for L times, wherein each loop iteration starts from the initial moment to obtain an initialized network state till the termination moment; performing the operations of the step 1 to the step 3 at each decision moment, and training and updating network parameters; and determining an optimal access strategy according to the trained strategy network.
In an embodiment of the present invention, the present invention further includes a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements any one of the above-mentioned 5G ultra-dense network multi-user access selection methods based on deep reinforcement learning.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
A5G ultra-dense network multi-user access selection device based on deep reinforcement learning comprises a processor and a memory; the memory is used for storing a computer program; the processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the 5G ultra-dense network multi-user access selection device based on deep reinforcement learning to execute any one of the 5G ultra-dense network multi-user access selection methods based on deep reinforcement learning.
Specifically, the memory includes: various media that can store program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.
Preferably, the Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.
The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A5G ultra-dense network multi-user access selection method based on deep reinforcement learning is characterized by comprising the following steps:
s1: acquiring each network attribute, and setting a threshold value according to the lowest tolerance of a user to each network attribute; screening the network according to the set threshold value to obtain the network meeting the user requirement;
s2: determining the attribute comprehensive weight of each network by adopting an analytic hierarchy process and an entropy weight process;
s3: calculating a weighted gray correlation coefficient of the network by adopting a gray correlation analysis method according to the attribute comprehensive weight of the network;
s4: modeling a multi-user access problem into a multi-user random game model according to the weighted gray correlation coefficient;
s5: and solving the optimal strategy of the multi-person random game model by adopting a depth certainty strategy gradient method, and accessing the user to the network according to the optimal strategy.
2. The method for selecting the 5G ultra-dense network multi-user access based on the deep reinforcement learning of claim 1, wherein the process of screening the network comprises:
s11: acquiring attributes of each network, wherein the attributes comprise bandwidth, energy efficiency, time delay, jitter and tariff;
s12: acquiring minimum tolerance of a user to different network attributes, and calculating the average value of the minimum tolerance to obtain a reference network H, wherein the average value of the minimum tolerance of the reference network is a set threshold;
s13: and comparing the candidate network with the reference network H, deleting the network in the candidate network if one attribute in the candidate network is lower than the set threshold, and keeping the network in the candidate network if all the attributes in the candidate network are greater than or equal to the set threshold.
3. The 5G ultra-dense network multi-user access selection method based on deep reinforcement learning of claim 1, wherein the process of determining the comprehensive weight of the attributes of each network by using an analytic hierarchy process and an entropy weight method comprises:
s21: constructing a hierarchical model, wherein the hierarchical model comprises a target layer, a criterion layer and a scheme layer, the target layer represents an optimal network which a user requesting a specific service desires to access, the criterion layer represents the requirement of service request service on network parameters, and the scheme layer represents a candidate network in a system model;
s22: comparing every two attributes of the same layer network, and constructing a decision attribute comparison matrix by adopting a 1-9 scale assignment method;
s23: normalizing the decision attribute comparison matrix to obtain a network attribute weight;
s24: calculating the consistency ratio of the matrix according to the decision attribute comparison matrix, setting a consistency ratio threshold, if the consistency ratio of the matrix is less than or equal to the set consistency ratio threshold, the network attribute weight is reasonable, otherwise, reconstructing the decision attribute comparison matrix until the consistency ratio threshold is met;
s25: constructing an attribute decision matrix, and standardizing the matrix; normalizing the standardized attribute decision matrix;
s26: calculating the information entropy of the attribute decision matrix after normalization processing;
s27: calculating the entropy weight of the network attribute according to the information entropy to obtain an objective weight of the network attribute;
s28: and processing the network attribute weight and the objective weight by adopting a product method, and normalizing the processed weights to obtain the comprehensive weight of the network attribute.
4. The 5G ultra-dense network multi-user access selection method based on deep reinforcement learning of claim 3, wherein the expression of the comprehensive weight of the network attribute is as follows:
Figure FDA0003493064730000021
wherein ,
Figure FDA0003493064730000022
representing network attribute weight, i representing network decision attribute type, k representing user service type, betaiThe objective weights representing the network attributes, and l represents the total number of each network decision attribute type.
5. The method for selecting the 5G ultra-dense network multi-user access based on the deep reinforcement learning of claim 1, wherein the process of calculating the weighted gray correlation coefficient of the network comprises:
s31: constructing a network attribute matrix E;
s32: standardizing and normalizing the network attribute matrix to obtain a normalized network attribute matrix E';
s33: determining the attribute type of the network, and determining the attribute of the optimal reference network according to the attribute type;
s34: respectively calculating the correlation of different attributes in the candidate network with the corresponding attributes of the reference network to obtain grey correlation coefficients;
s35: and multiplying the gray correlation coefficient by the attribute comprehensive weight to obtain a weighted gray correlation coefficient.
6. The 5G ultra-dense network multi-user access selection method based on deep reinforcement learning of claim 5, wherein the grey correlation coefficient is calculated by the following formula:
Figure FDA0003493064730000031
wherein ρ represents a resolution coefficient, e'ojRepresents the jth attribute, e 'in the reference network'ijAnd elements of the ith row and the jth column in the normalized network attribute matrix are represented.
7. The 5G ultra-dense network multi-user access selection method based on deep reinforcement learning of claim 1, wherein the process of constructing the multi-user random game model comprises:
obtaining an environmental State S, a set of actions A1,…,ANAnd a set of agents partial observation sets O of the environment1,…,ON, wherein
Figure FDA0003493064730000032
Figure FDA0003493064730000033
Indicating the number n of end users connected to the network i providing the service k, a ═ { a ═ 1,2 …, m }, a indicating one of the candidate networks selected by the user terminal in the specific network state;
each agent according to its own policy
Figure FDA0003493064730000034
Selection action AnThe policy function is expressed as
Figure FDA0003493064730000035
Figure FDA0003493064730000036
Determining the next state of the environment according to the selection action and the transition probability of the agent, wherein the transition probability is P: s X A1×…×AN→ S; each agent selection action is followed by a reward r from the environmentnAnd new partial observation data o of the environmentnThe reward function is expressed as rn:S×An→ R, the observation function is denoted on:S→On
According to a weighted grayCorrelation coefficient calculation network utility function capable of representing user QoS
Figure FDA0003493064730000037
Figure FDA0003493064730000038
The bonus function r of each user terminaln(on,an) Q (1- η), where η represents the network load and the reward function per user terminal is
Figure FDA0003493064730000039
The joint return function of the user terminal is expressed as
Figure FDA00034930647300000310
Wherein γ represents a discount coefficient, and T represents a time range; finding the optimal strategy according to the return function to accumulate the expected return RnMaximum, i.e. expressed as
Figure FDA00034930647300000311
wherein ρπRepresenting the state distribution function, J (θ)n) Represents the expected return RnIn the expectation that the position of the target is not changed,
Figure FDA00034930647300000312
indicating the desired function, thetanThe parameters of the neural network are represented by,
Figure FDA00034930647300000313
representing an agent policy function.
8. The deep reinforcement learning-based 5G ultra-dense network multi-user access selection method according to claim 1, wherein the process of solving the optimal strategy comprises:
s51: constructing an evaluation network and a strategy network; the evaluation network comprises a state estimation network and a state reality network, and the strategy network comprises an action estimation network and an action reality network;
s52: storing the switching decision data into a space D for repeated learning and training of an evaluation network and a strategy network; switching decision data into a tuple
Figure FDA0003493064730000041
The environment state of the decision time t, the actions and rewards of all agents and the environment state of the next decision time t +1 are included;
s53: randomly extracting M groups of switching data in the storage space D as training samples, and inputting the training samples into a state estimation network for training to obtain a trained state estimation network;
s54: updating the action estimation network in a strategy gradient mode; calculating strategy gradient, and estimating network parameters by using gradient ascent method
Figure FDA0003493064730000042
Updating towards the direction of increasing the value function, and acquiring the Q value of the state action pair through the trained state estimation network;
s55: respectively carrying out soft updating on the state reality network and the action reality network by adopting a moving average method according to the parameters of the state estimation network and the action estimation network;
s56: performing loop iteration for L times, wherein each loop iteration starts from the initial moment to obtain an initialized network state till the termination moment; performing operation steps S53-S55 at each decision moment, and training and updating network parameters; and determining an optimal access strategy according to the trained strategy network.
9. A computer readable storage medium having stored thereon a computer program for execution by a processor to implement the method for deep reinforcement learning based 5G ultra dense network multi-user access selection according to any one of claims 1 to 8.
10. A5G ultra-dense network multi-user access selection device based on deep reinforcement learning is characterized by comprising a processor and a memory; the memory is used for storing a computer program; the processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the deep reinforcement learning-based 5G ultra-dense network multi-user access selection device to execute the deep reinforcement learning-based 5G ultra-dense network multi-user access selection method in any one of claims 1 to 8.
CN202210103469.2A 2022-01-27 2022-01-27 5G ultra-dense network multi-user access selection method based on deep reinforcement learning Active CN114449536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210103469.2A CN114449536B (en) 2022-01-27 2022-01-27 5G ultra-dense network multi-user access selection method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210103469.2A CN114449536B (en) 2022-01-27 2022-01-27 5G ultra-dense network multi-user access selection method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN114449536A true CN114449536A (en) 2022-05-06
CN114449536B CN114449536B (en) 2023-05-23

Family

ID=81370383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210103469.2A Active CN114449536B (en) 2022-01-27 2022-01-27 5G ultra-dense network multi-user access selection method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114449536B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822329A (en) * 2023-05-11 2023-09-29 贵州大学 Decision method for multi-user power control in wireless network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103108382A (en) * 2012-11-20 2013-05-15 南京邮电大学 Heterogeneous network multi-attribute decision-making method based on network analytic hierarchy process
CN103607756A (en) * 2013-10-29 2014-02-26 南京邮电大学 Heterogeneous-network multiattribute decision-making method based on synergetic principle
CN107734512A (en) * 2017-09-30 2018-02-23 南京南瑞集团公司 A kind of network selecting method based on the analysis of gray scale relevance presenting levelses
CN110225535A (en) * 2019-06-04 2019-09-10 吉林大学 Heterogeneous wireless network vertical handoff method based on depth deterministic policy gradient
US20200175364A1 (en) * 2017-05-19 2020-06-04 Deepmind Technologies Limited Training action selection neural networks using a differentiable credit function

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103108382A (en) * 2012-11-20 2013-05-15 南京邮电大学 Heterogeneous network multi-attribute decision-making method based on network analytic hierarchy process
CN103607756A (en) * 2013-10-29 2014-02-26 南京邮电大学 Heterogeneous-network multiattribute decision-making method based on synergetic principle
US20200175364A1 (en) * 2017-05-19 2020-06-04 Deepmind Technologies Limited Training action selection neural networks using a differentiable credit function
CN107734512A (en) * 2017-09-30 2018-02-23 南京南瑞集团公司 A kind of network selecting method based on the analysis of gray scale relevance presenting levelses
CN110225535A (en) * 2019-06-04 2019-09-10 吉林大学 Heterogeneous wireless network vertical handoff method based on depth deterministic policy gradient

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
GUANGDONG TIAN: "AHP, Gray Correlation, and TOPSIS Combined Approach to Green Performance Evaluation of Design Alternatives", IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS *
冯宝;高雪;龚亮亮;: "电力无线通信***中基于灰度关联层次分析的异构网络选择算法", 电子设计工程 *
唐宏: "实用拜占庭容错算法的改进研究", 计算机工程与应用 *
李旺红: "异构网络中基于机器学习的网络选择算法", 中国优秀硕士学位论文全文数据库 *
秦永刚;窦竹梅;胡钢;: "基于动态权重的异构无线网络选择算法", 计算机与信息技术 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822329A (en) * 2023-05-11 2023-09-29 贵州大学 Decision method for multi-user power control in wireless network

Also Published As

Publication number Publication date
CN114449536B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN113242568B (en) Task unloading and resource allocation method in uncertain network environment
CN112860350B (en) Task cache-based computation unloading method in edge computation
Elgendy et al. Joint computation offloading and task caching for multi-user and multi-task MEC systems: reinforcement learning-based algorithms
CN110225535B (en) Heterogeneous wireless network vertical switching method based on depth certainty strategy gradient
CN108235390B (en) Vertical switching method based on Bayesian decision in heterogeneous wireless network
CN110968426B (en) Edge cloud collaborative k-means clustering model optimization method based on online learning
CN107948083B (en) SDN data center congestion control method based on reinforcement learning
WO2019184836A1 (en) Data analysis device, and multi-model co-decision system and method
CN113784410B (en) Heterogeneous wireless network vertical switching method based on reinforcement learning TD3 algorithm
CN111026548B (en) Power communication equipment test resource scheduling method for reverse deep reinforcement learning
CN113938488A (en) Load balancing method based on dynamic and static weighted polling
Fan et al. Dnn deployment, task offloading, and resource allocation for joint task inference in iiot
CN114490057A (en) MEC unloaded task resource allocation method based on deep reinforcement learning
CN114585006B (en) Edge computing task unloading and resource allocation method based on deep learning
Liu et al. Fedpa: An adaptively partial model aggregation strategy in federated learning
CN110321217A (en) A kind of cloud resource dispatching method, device, equipment and the storage medium of multiple target
CN110162390A (en) A kind of method for allocating tasks and system of mist computing system
Wang et al. DeepNetQoE: Self-adaptive QoE optimization framework of deep networks
CN110996365B (en) Heterogeneous network vertical switching algorithm and system based on multi-objective optimization model
CN114449536B (en) 5G ultra-dense network multi-user access selection method based on deep reinforcement learning
Tang et al. Adaptive inference reinforcement learning for task offloading in vehicular edge computing systems
Zhou et al. Content placement with unknown popularity in fog radio access networks
Banerjee et al. Boosting exploration in actor-critic algorithms by incentivizing plausible novel states
CN112862060B (en) Content caching method based on deep learning
CN113543160A (en) 5G slice resource allocation method and device, computing equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant