CN114449536A

CN114449536A - 5G ultra-dense network multi-user access selection method based on deep reinforcement learning

Info

Publication number: CN114449536A
Application number: CN202210103469.2A
Authority: CN
Inventors: 唐宏; 刘小洁; 陈榕
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-06
Anticipated expiration: 2042-01-27
Also published as: CN114449536B

Abstract

The invention relates to the technical field of machine learning and communication, in particular to a 5G ultra-dense network multi-user access selection method based on deep reinforcement learning, which comprises the following steps: setting a threshold value according to the lowest tolerance of the user to each network attribute, and excluding networks which do not meet the requirements completely; determining the comprehensive weight of the network attribute by using an analytic hierarchy process and an entropy weight method; obtaining a weighted gray correlation coefficient by combining a gray correlation analysis method and network attribute comprehensive weight; modeling a multi-user access problem into a multi-user random game model; solving an optimal strategy by using a depth certainty strategy gradient method; the method can improve the system throughput, reduce user blockage and reduce time delay on the premise of ensuring the individual requirements of different service users.

Description

5G ultra-dense network multi-user access selection method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of machine learning and communication, in particular to a 5G ultra-dense network multi-user access selection method based on deep reinforcement learning.

Background

The 5G ultra-dense network architecture integrates different Radio Access Technologies (RATs), and will support large-scale communication connection of massive internet of things devices. However, with the rapid growth of internet of things devices, the personalization of service requirements, and the continuous deepening of network system heterogeneity, how to design an intelligent network access selection scheme for user equipment becomes a key challenge facing 5G ultra-dense networks.

Currently, researchers in this field have proposed a variety of different access selection algorithms, which can be mainly classified into the following four categories: 1) the access algorithm based on single parameter mainly takes the early access algorithm based on Received Signal Strength (RSS) as the main algorithm; 2) the access algorithm based on the multi-attribute decision (MADM) is a more common MADM algorithm such as a simple weighting method (SAW), a multiplication exponential weighting Method (MEW), an Analytic Hierarchy Process (AHP), a near ideal solution ordering method (TOPSIS), a gray correlation analysis method (GRA) and the like; 3) based on intelligent access algorithms, the intelligent algorithms mainly comprise fuzzy logic, an artificial neural network, Q learning, an artificial bee colony algorithm, a particle swarm algorithm and the like; 4) the prediction-based access algorithm includes a location information-based mobility trend prediction method and a Markov Decision Process (MDP) -based network information prediction method.

However, most of the existing network access selection methods only optimize and select the policy from the user side or the network side, which causes serious network congestion, poor user experience, and reduced system performance. Therefore, how to design a user access algorithm which comprehensively considers user preference, service demand, experience quality and network actual conditions and adapts to a dynamic complex 5G ultra-dense network environment is a problem still to be solved by current researchers.

Disclosure of Invention

Based on the problems in the prior art, the invention provides a 5G ultra-dense network multi-user access selection method based on deep reinforcement learning, which comprises the following steps:

s1: acquiring each network attribute, and setting a threshold value according to the minimum tolerance of a user to each network attribute; screening the network according to the set threshold value to obtain the network meeting the user requirement;

s2: determining the attribute comprehensive weight of each network by adopting an analytic hierarchy process and an entropy weight process;

s3: calculating a weighted gray correlation coefficient of the network by adopting a gray correlation analysis method according to the attribute comprehensive weight of the network;

s4: modeling a multi-user access problem into a multi-user random game model according to the weighted gray correlation coefficient;

s5: and solving the optimal strategy of the multi-person random game model by adopting a depth certainty strategy gradient method, and accessing the user to the network according to the optimal strategy.

Preferably, the process of screening the network includes:

s11: acquiring attributes of each network, wherein the attributes comprise bandwidth, energy efficiency, time delay, jitter and tariff;

s12: acquiring minimum tolerance of a user to different network attributes, and calculating the average value of the minimum tolerance to obtain a reference network H, wherein the average value of the minimum tolerance of the reference network is a set threshold;

s13: and comparing the candidate network with the reference network H, deleting the network in the candidate network if one attribute in the candidate network is lower than the set threshold, and keeping the network in the candidate network if all the attributes in the candidate network are greater than or equal to the set threshold.

Preferably, the process of determining the attribute comprehensive weight of each network by using the analytic hierarchy process and the entropy weight process includes:

s21: constructing a hierarchical model, wherein the hierarchical model comprises a target layer, a criterion layer and a scheme layer, the target layer represents an optimal network which a user requesting a specific service desires to access, the criterion layer represents the requirement of service request service on network parameters, and the scheme layer represents a candidate network in a system model;

s22: comparing every two attributes of the same layer network, and constructing a decision attribute comparison matrix by adopting a 1-9 scale assignment method;

s23: normalizing the decision attribute comparison matrix to obtain a network attribute weight;

s24: calculating the consistency ratio of the matrix according to the decision attribute comparison matrix, setting a consistency ratio threshold, if the consistency ratio of the matrix is less than or equal to the set consistency ratio threshold, the network attribute weight is reasonable, otherwise, reconstructing the decision attribute comparison matrix until the consistency ratio threshold is met;

s25: constructing an attribute decision matrix, and standardizing the matrix; normalizing the standardized attribute decision matrix;

s26: calculating the information entropy of the attribute decision matrix after normalization processing;

s27: calculating the entropy weight of the network attribute according to the information entropy to obtain an objective weight of the network attribute;

s28: and processing the network attribute weight and the objective weight by adopting a product method, and normalizing the processed weights to obtain the comprehensive weight of the network attribute.

Further, the expression of the comprehensive weight of the network attribute is as follows:

wherein ,

representing network attribute weight, i representing network decision attribute type, k representing user service type, beta_iThe objective weights representing the network attributes, and l represents the total number of each network decision attribute type.

Preferably, the process of calculating the weighted gray correlation coefficient of the network includes:

s31: constructing a network attribute matrix E;

s32: standardizing and normalizing the network attribute matrix to obtain a normalized network attribute matrix E';

s33: determining the attribute type of the network, and determining the attribute of the optimal reference network according to the attribute type;

s34: respectively calculating the correlation between different attributes in the candidate network and the corresponding attributes of the reference network to obtain grey correlation coefficients;

s35: and multiplying the gray correlation coefficient by the attribute comprehensive weight to obtain a weighted gray correlation coefficient.

Further, the gray correlation coefficient is calculated by the following formula:

wherein ρ represents a resolution coefficient, e'_ojRepresents the jth attribute, e 'in the reference network'_ijAnd elements of the ith row and the jth column in the normalized network attribute matrix are represented.

Preferably, the process of constructing the multi-person random game model comprises the following steps: obtaining an environmental State S, a set of actions A₁,…,A_NAnd a set of agents partial observation sets O of the environment₁,…,O_N, wherein

Indicating the number n of end users connected to the network i providing the service k, a ═ { a ═ 1,2 …, m }, a indicating one of the candidate networks selected by the user terminal in the specific network state; each agent according to its own policy

Selection action A_nThe policy function is expressed as

Determining the next state of the environment according to the selection action of the agent and the transition probability, wherein the transition probability is P: S multiplied by A₁×…×A_N→ S; each agent selection action is followed by a reward r from the environment_nAnd new partial observation data o of the environment_nThe reward function is expressed as r_n:S×A_n→ R, the observation function is denoted o_n:S→O_n(ii) a According to weighted gray phasesRelationship calculation network utility function capable of representing user QoS

The bonus function r of each user terminal_n(o_n,a_n) Q (1- η), where η represents the network load and the reward function per user terminal is

The joint reward function of the user terminal is expressed as

Wherein γ represents a discount coefficient, and T represents a time range; the ultimate goal of the agent is to find the optimal strategy to accumulate the expected return R_nMaximum, is expressed as

wherein ρ^πRepresenting a state distribution function.

Preferably, the process of solving the optimal strategy includes:

s51: constructing an evaluation network and a strategy network; the evaluation network comprises a state estimation network and a state reality network, and the strategy network comprises an action estimation network and an action reality network;

s52: storing the switching decision data into a space D for repeated learning and training of an evaluation network and a strategy network; switching decision data into a tuple

The environment state of the decision time t, the actions and rewards of all agents and the environment state of the next decision time t +1 are included;

s53: randomly extracting M groups of switching data in the storage space D as training samples, and inputting the training samples into a state estimation network for training to obtain a trained state estimation network;

s54: updating the action estimation network in a strategy gradient mode; calculating strategy gradient, and estimating network parameters by using gradient ascent method

Updating towards the direction of increasing the value function, and acquiring the Q value of the state action pair through the trained state estimation network;

s55: respectively carrying out soft updating on the state reality network and the action reality network by adopting a moving average method according to the parameters of the state estimation network and the action estimation network;

s56: performing loop iteration for L times, wherein each loop iteration starts from the initial moment to obtain an initialized network state till the termination moment; performing operation steps S53-S55 at each decision moment, and training and updating network parameters; and determining an optimal access strategy according to the trained strategy network.

To achieve the above object, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements any one of the above 5G ultra-dense network multi-user access selection methods based on deep reinforcement learning.

In order to achieve the above object, the present invention further provides a 5G super-dense network multi-user access selection device based on deep reinforcement learning, which includes a processor and a memory; the memory is used for storing a computer program; the processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the 5G ultra-dense network multi-user access selection device based on deep reinforcement learning to execute any one of the 5G ultra-dense network multi-user access selection methods based on deep reinforcement learning.

The invention has the beneficial effects that:

aiming at the problem of multi-user access selection under a 5G ultra-dense network, the invention optimizes network selection from two aspects of users and networks; the network is preliminarily screened before the network decision-making, so that the aim of reducing the switching times is fulfilled; in order to better meet the differentiated service requirements and efficiently ensure the user experience, a Markov decision model is utilized to model the network selection problem, an analytic hierarchy process, an entropy weight process and a gray correlation analysis process are jointly adopted, and the user preference of each network is realized under the condition that the network conditions allow; aiming at the complex decision problem in the dynamic network, the optimal strategy is solved by adopting a deep deterministic strategy gradient method, so that the user congestion is reduced and the system throughput is improved on the premise of ensuring the user experience.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a block diagram of a hierarchy for determining subjective weighting of parameters in accordance with the present invention;

FIG. 3 is a diagram of a system model of the present invention;

FIG. 4 is a diagram of the structure of MADDPG according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A 5G ultra-dense network multi-user access selection method based on deep reinforcement learning, as shown in fig. 1, the method includes:

s1: acquiring each network attribute, and setting a threshold value according to the lowest tolerance of a user to each network attribute; screening the network according to the set threshold value to obtain the network meeting the user requirement;

In a possible implementation, the step of S1, setting a threshold according to the minimum tolerance of the user to each network attribute, and excluding completely unsatisfactory networks, includes:

s11: acquiring the attribute of each network; the network attributes include: bandwidth, energy efficiency, latency, jitter, and tariff.

S12: acquiring the minimum tolerance of a user to different network attributes, and calculating the average value of the minimum tolerance to obtain a reference network H, wherein the average value of the minimum tolerance of the reference network is a set threshold; specifically, the lowest tolerance of the user to different network attributes is collected, and an average value is taken to obtain a reference network H, which is specifically represented as:

wherein N belongs to N and represents the nth user, N is the total number of users, j represents the jth network attribute, h'_jRepresents a threshold, i.e. the average minimum tolerance of the user to the network property j.

In one possible embodiment, as shown in fig. 2, the step of determining subjective weights of network attributes by using an analytic hierarchy process in S2 includes:

and (5) constructing a hierarchical model. The decision scheme is divided into three layers: a target layer (representing the best network a user requesting a particular service desires to access), a criteria layer (representing the requirements of the service request service on network parameters), a plan layer (representing candidate networks in the system model).

Respectively constructing judgment comparison matrixes of different Internet of things services according to the importance degree of different network attributes in network selection, wherein the services comprise intelligent health services, VR & AR services and industrial mechanical services, and are expressed as K ∈ K ═ {1,2,3}, pairwise comparison is carried out between the attributes of the same layer network, a 1-9 scale assignment method is adopted to construct a comparison judgment matrix, and the expression is as follows:

wherein ,

represents the ith attribute c under the requirement of user service k_iRelative to jth attribute c_jThe comparison scale value of (a), l represents the total type number of the network attributes,

the values of (A) are shown in Table 1.

TABLE 1

As an emerging medical application, intelligent health services provide telemedicine for patients, and particularly telesurgery, and precise remote operation needs to be performed in a stable environment. Thus, the intelligent health service has a high demand for delay and jitter, and thus, a decision matrix for the intelligent health service is constructed

As shown in table 2.

TABLE 2

VR&Typical application scenarios for AR services include holographic navigation and intelligent VR gamingHigh bandwidth is required to provide a good experience for the user, therefore, constructing VR&Decision matrix for AR services

As shown in table 3.

TABLE 3

The application scene of the industrial mechanical service comprises real-time monitoring of production equipment and remote control of construction machinery, and the requirement on delay is high, so that a judgment matrix of the industrial mechanical service is constructed

As shown in table 4.

TABLE 4

Will be provided with

Each column of (a) is normalized to obtain a normalized matrix, the expression of which is:

calculating the weight value according to the normalized decision attribute, wherein the weight value is a subjective weight and the expression is as follows:

defining a consistency index

Represents the decision matrix C_kIs given as the maximum feature root, the consistency ratio is expressed as

Where RI is a randomly generated consistency index, which can be obtained from table 5. And when CR is less than or equal to 0.1, the weight is considered to be reasonable, otherwise, the judgment matrix needs to be reconstructed until the condition is met.

TABLE 5

n	1	2	3	4	5	6	7	8	9
										RI	0	0	0.52	0.89	1.12	1.26	1.36	1.41	1.46

In a possible implementation, in S2, the solving the objective network attribute weight by using the entropy weight method includes:

constructing an attribute decision matrix X ═ (X)_ij)_m×lNormalized, and is noted as B ═ B_ij)_m×l。

For benefit type attributes (such as bandwidth, energy efficiency, etc.), the normalized expression is:

for cost-type attributes (such as delay, jitter, etc.), the normalized expression is:

the judgment matrix is normalized, and the normalized value of the j attribute of the ith network is

From this, the information entropy of the network attributes can be determined

wherein

Calculating entropy weight according to information entropy to obtain objective weight beta of network attribute_jIs specifically shown as

And obtaining the comprehensive weight of the network attribute by adopting a product method and normalization, wherein the comprehensive weight is expressed as:

wherein ,

In a possible embodiment, the step of S3 obtaining the weighted gray correlation coefficient by combining the gray correlation analysis method and the S network attribute comprehensive weight includes:

s31: constructing a network attribute matrix E, wherein the constructed expression is as follows:

wherein, the number of candidate networks is represented, l represents the total number of each network decision attribute type, e_ijRepresenting an attribute j of network i.

S32: normalizing and normalizing the network attributes by using the method mentioned in step S2 to obtain a matrix E ', the expression of the matrix E' is:

s33: and determining the attribute type of the network, and determining the attribute of the optimal reference network according to the attribute type. When the attribute is a benefit type attribute, the mostThe optimal attribute value is the maximum value in each column, when the attribute is the cost type attribute, the optimal attribute is the minimum value in each column and is marked as x_oIs shown as

S34: and respectively calculating the correlation between different attributes in the candidate network and the corresponding attributes of the reference network to obtain a grey correlation coefficient. The expression for the grey correlation coefficient is:

where ρ represents a resolution coefficient, and ρ ∈ [0,1 ].

S35: and multiplying the gray correlation coefficient by the attribute comprehensive weight to obtain a weighted gray correlation coefficient. The expression for the weighted gray correlation coefficient is:

in a possible implementation, the modeling of the multi-user access problem as the multi-user random game model in S4 includes:

in order to distinguish the previous research on the heterogeneous network selection model, reflect the density and the heterogeneous type of 5G, and construct a heterogeneous network system model consisting of 5G, LTE, WiMAX, UMTS BS and two Wi-Fi 6 APs, i.e. m is 6, as shown in fig. 3. In a heterogeneous network environment, 100 user terminals are randomly generated, the terminals randomly generate K which is 3 service types, the arrival and departure of the terminals are distributed according to Poisson, and the service time is distributed according to a negative index.

The Markov Decision Process (MDP) is extended to a partially observable Markov decision process for multi-agents. The Markov game of N agents consists of a set of environmental states S and a set of actions A₁,…,A_NAnd a set of agents partial observation sets O of the environment₁,…,O_NAs defined. Each agent according to its own policy

Selection action A_nThe policy function is expressed as

The environment is transferred to the next state with a certain probability according to the action selected by the agent, and the transfer probability is expressed as P: S X A₁×…×A_N→ S. Each agent selection action is followed by a reward r from the environment_nAnd new partial observation data o of the environment_nThe reward function is expressed as r_n:S×A_n→ R, the observation function is denoted o_n:S→O_n. The initial state of the environment is determined by the distribution function ρ, denoted ρ: S → [0,1]. The goal of each agent n is to maximize its own total expected return, with the return function being

Where γ denotes a discount coefficient and T denotes a time range.

In the multi-user access selection problem, the arrival and departure of a user will cause a change in the environmental state, thus defining the environmental state S as

wherein

Representing the number n of end users connected to the network i providing the service k.

In the multiple-user access selection problem, the user terminal takes a network selection action in each environment state, so the action of each agent in the model may be defined as a ═ { a ═ 1,2 …, m }, which represents a certain candidate network selected by the user terminal in a specific network state.

In the multi-user access selection problem, the user terminal's goal is to select the network that accumulates the greatest utility of the expected network. Therefore, the utility of the network capable of representing the QoS of the user is calculated according to the weighted gray correlation coefficientFunction(s)

From this the reward function r for each user terminal can be derived_n(o_n,a_n) Q (1- η), where η represents the network load, the reward function per user terminal is then

Thus, the joint reward function of the user terminal is expressed as

The ultimate goal of the agent is to find the optimal strategy to accumulate the expected return R_nMaximum, is expressed as

wherein ρ^πRepresenting a state distribution function.

In a possible implementation, as shown in fig. 4, the step S5 of solving the optimal policy by using a depth deterministic policy gradient method includes:

the evaluation network and the strategy network are constructed by adopting a BP neural network, and the neural network in the embodiment is composed of an input layer, a hidden layer and an output layer. Each agent has its own evaluation network and policy network.

The user terminal continuously interacts with the environment: in an initial state s_tThen, the deterministic strategy output by the strategy network is converted into a random process selection action

After each agent executes the action, the agent obtains corresponding reward from the environment

And the next state s of the network_t+1Then the next step is entered, and the loop is continued until the moment when the connection is terminated. And then entering the next epsilon and continuously circulating. Will interact with each otherData generated by the process

Storing the data into the space D for repeated learning and training of the neural network;

when the number of samples stored in the storage space D meets the requirement of neural network training on the data volume, after each interaction between the agent and the environment, a small batch of switching samples are randomly sampled and extracted from the storage space D and used as training data of the state estimation network and the action estimation network.

Step 1: training a state estimation network by adopting a random gradient descent method;

a loss function is constructed using a mean square error model, expressed as:

wherein

Which is an estimate, is the output of the state estimation network,

is a local view of the environment by all agents, y_tThe true value of the sample, considered as a label, is expressed as:

wherein gamma (0 is more than or equal to gamma and less than or equal to 1) is a discount factor,

is the Q value estimated by the state reality network,

is a deterministic policy output by the action reality network.

And according to the constructed loss function, iteratively solving a minimized extreme value of the loss function by using a gradient descent method through back propagation so as to update the state estimation network.

Step 2: and training the motion estimation network by adopting a random gradient ascent method.

After the state estimation network is updated, the action estimation network is updated in a strategy gradient mode, wherein the strategy gradient is expressed as:

after obtaining the strategy gradient, updating the action estimation network parameter towards the direction of increasing the value function by adopting a gradient ascending method

And acquiring the Q value of the state action pair through the state estimation network.

And step 3: and updating the state reality network and the action reality network by adopting a soft updating method.

Using a moving average method to respectively perform soft update on the parameters of the state estimation network and the motion estimation network on the real network and the motion real network, wherein the soft update is represented as:

where α is the learning rate.

Performing loop iteration for L times, wherein each loop iteration starts from the initial moment to obtain an initialized network state till the termination moment; performing the operations of the step 1 to the step 3 at each decision moment, and training and updating network parameters; and determining an optimal access strategy according to the trained strategy network.

In an embodiment of the present invention, the present invention further includes a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements any one of the above-mentioned 5G ultra-dense network multi-user access selection methods based on deep reinforcement learning.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

A5G ultra-dense network multi-user access selection device based on deep reinforcement learning comprises a processor and a memory; the memory is used for storing a computer program; the processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the 5G ultra-dense network multi-user access selection device based on deep reinforcement learning to execute any one of the 5G ultra-dense network multi-user access selection methods based on deep reinforcement learning.

Specifically, the memory includes: various media that can store program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.

Preferably, the Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A5G ultra-dense network multi-user access selection method based on deep reinforcement learning is characterized by comprising the following steps:

2. The method for selecting the 5G ultra-dense network multi-user access based on the deep reinforcement learning of claim 1, wherein the process of screening the network comprises:

3. The 5G ultra-dense network multi-user access selection method based on deep reinforcement learning of claim 1, wherein the process of determining the comprehensive weight of the attributes of each network by using an analytic hierarchy process and an entropy weight method comprises:

4. The 5G ultra-dense network multi-user access selection method based on deep reinforcement learning of claim 3, wherein the expression of the comprehensive weight of the network attribute is as follows:

wherein ,

5. The method for selecting the 5G ultra-dense network multi-user access based on the deep reinforcement learning of claim 1, wherein the process of calculating the weighted gray correlation coefficient of the network comprises:

s31: constructing a network attribute matrix E;

s34: respectively calculating the correlation of different attributes in the candidate network with the corresponding attributes of the reference network to obtain grey correlation coefficients;

6. The 5G ultra-dense network multi-user access selection method based on deep reinforcement learning of claim 5, wherein the grey correlation coefficient is calculated by the following formula:

7. The 5G ultra-dense network multi-user access selection method based on deep reinforcement learning of claim 1, wherein the process of constructing the multi-user random game model comprises:

obtaining an environmental State S, a set of actions A₁，…，A_NAnd a set of agents partial observation sets O of the environment₁，…，O_N, wherein

Indicating the number n of end users connected to the network i providing the service k, a ═ { a ═ 1,2 …, m }, a indicating one of the candidate networks selected by the user terminal in the specific network state;

each agent according to its own policy

Selection action A_nThe policy function is expressed as

Determining the next state of the environment according to the selection action and the transition probability of the agent, wherein the transition probability is P: s X A₁×…×A_N→ S; each agent selection action is followed by a reward r from the environment_nAnd new partial observation data o of the environment_nThe reward function is expressed as r_n：S×A_n→ R, the observation function is denoted o_n：S→O_n；

According to a weighted grayCorrelation coefficient calculation network utility function capable of representing user QoS

The bonus function r of each user terminal_n(o_n，a_n) Q (1- η), where η represents the network load and the reward function per user terminal is

The joint return function of the user terminal is expressed as

Wherein γ represents a discount coefficient, and T represents a time range; finding the optimal strategy according to the return function to accumulate the expected return R_nMaximum, i.e. expressed as

wherein ρ^πRepresenting the state distribution function, J (θ)_n) Represents the expected return R_nIn the expectation that the position of the target is not changed,

indicating the desired function, theta_nThe parameters of the neural network are represented by,

representing an agent policy function.

8. The deep reinforcement learning-based 5G ultra-dense network multi-user access selection method according to claim 1, wherein the process of solving the optimal strategy comprises:

9. A computer readable storage medium having stored thereon a computer program for execution by a processor to implement the method for deep reinforcement learning based 5G ultra dense network multi-user access selection according to any one of claims 1 to 8.

10. A5G ultra-dense network multi-user access selection device based on deep reinforcement learning is characterized by comprising a processor and a memory; the memory is used for storing a computer program; the processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the deep reinforcement learning-based 5G ultra-dense network multi-user access selection device to execute the deep reinforcement learning-based 5G ultra-dense network multi-user access selection method in any one of claims 1 to 8.