CN116456493A

CN116456493A - D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm

Info

Publication number: CN116456493A
Application number: CN202310426343.3A
Authority: CN
Inventors: 李君�; 刘兴鑫; 刘子怡; 沈国丽; 张茜茜; 李晨
Original assignee: Wuxi University
Current assignee: Wuxi University
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-07-18

Abstract

The invention discloses a D2D user resource allocation method and a storage medium based on a deep reinforcement learning algorithm, and relates to the technical field of wireless communication. The method comprises the following steps: constructing a wireless network model, and discretizing D2D transmitting power; constructing a user signal-to-noise ratio calculation model, and taking the maximum throughput of the communication system as an optimization target; setting a prediction strategy network pi, a prediction value network Q, a target strategy network pi 'and a target value network Q'; modeling a D2D communication environment as a Markov decision process, regarding a D2D transmitter as an intelligent agent, circularly loading parameters of a target strategy network, generating a strategy to interact with the environment, and determining a state space, an action space and a reward function; carrying out strategy optimization on each D2D user by adopting a MAAC algorithm; circularly updating parameters of the target strategy network and the target value network by adopting a soft updating mode until learning training is completed; and D2D users download parameters of the target strategy network after training, and carry out strategy improvement.

Description

D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm

Technical Field

The invention relates to the technical field of wireless communication, in particular to a D2D user resource allocation method based on a deep reinforcement learning algorithm and a storage medium.

Background

In the age of rapid development of technology today, wireless communication technology is well known from people's daily life. The demands of people on mobile communication are rapidly increasing, the demands are becoming higher and higher, and the demands on definition and tone quality of videos are gradually improved from the previous mobile communication equipment which only needs to have a simple call function, to the subsequent basic internet searching which needs to be performed, to the current video brushing and music listening. However, the problem of lack of spectrum resources is particularly prominent in environments where the number of users is dense and the communication interference between each other is large, so we propose a number of methods to solve this problem.

One of the technologies, device-to-device (D2D), is a technology that directly exchanges information between neighboring devices in a communication network. Compared with the traditional cellular communication, the D2D communication technology is used, the D2D communication does not need to take a base station as a relay, so that the communication can be carried out in a place far away from the base station or even without the base station, the transmission pressure of the base station is effectively reduced, the frequency spectrum resource of a cellular user can be shared by the D2D communication technology, the frequency spectrum utilization rate is greatly improved, the throughput of the system is improved, and the performance of the whole communication system is improved.

In D2D communication technology, it is important for D2D users (D2D User Equipment, DUE) to perform reasonable power allocation and resource block allocation, and DUE mainly multiplexes spectrum resources occupied by cellular users (Cellular User Equipment, CUE), so interference exists among the DUE, CUE and Base Station (BS). In order to effectively avoid these interferences, improve the quality of service (Quality of Service, qoS) of D2D users, many solutions have been proposed. For example, in recent years, the problem of channel allocation and power control is handled by a very hot machine learning technology, and most of these are considered as an ideal model, i.e. the information of all users is determined. However, considering that in a real environment, neither DUE nor CUE exists in a dynamic manner, such as location information, channel gain, etc., the amount of information is huge, and the scene change rapidly causes great computational complexity, and the conventional optimization method cannot be applied.

Disclosure of Invention

The invention provides a D2D user resource allocation method and a storage medium based on a deep reinforcement learning algorithm, which are used for overcoming the defect that the prior art cannot adapt to a dynamic environment.

In order to solve the technical problems, the technical scheme of the invention is as follows:

in a first aspect, a D2D user resource allocation method based on a deep reinforcement learning algorithm includes:

constructing a wireless network model, and discretizing D2D transmitting power to generate K power levels; the wireless network model comprises a macro base station, L cellular users, N pairs of D2D user pairs and M orthogonal frequency spectrum resource blocks in the network coverage area of the macro base station, wherein parameters configured by the wireless network model comprise user positions;

constructing a user signal-to-noise ratio calculation model, which is used for calculating signal-to-noise ratio information of a D2D user and a cellular user, setting QoS requirements of the D2D user for communication with the cellular user, and optimizing a wireless network model by taking the maximum throughput of a communication system formed by the D2D user and the cellular user as an optimization target; the user signal-to-noise ratio comprises the signal-to-noise ratio of a D2D user receiving end and the signal-to-noise ratio of a cellular user;

the macro base station sets a prediction strategy network pi, a prediction value network Q, a target strategy network pi 'and a target value network Q' for each intelligent agent;

modeling a D2D communication environment as a Markov decision process, regarding a D2D transmitter as an intelligent agent, circularly loading parameters of a target strategy network, generating a strategy to interact with the environment, and determining a state space, an action space and a reward function; on the premise of meeting QoS requirements, each intelligent agent selects a communication mode to be adopted at the time t, executes actions according to the current observed state, obtains rewards and converts the rewards to the next state, and uploads an experience group to an experience pool for centralized training; wherein the communication mode comprises a dedicated mode, a multiplexing mode and a waiting mode, the states comprise position information and signal-to-noise ratio information of a D2D user and a cellular user, and the actions comprise selecting a power value and a resource block for communication;

performing strategy optimization on each D2D user by adopting a MAAC algorithm, performing centralized training by randomly sampling small batches from an experience pool, updating a predicted value network by adopting a TD algorithm, updating parameters of the predicted value network by adopting a gradient descent method, calculating accumulated rewards based on rewards obtained by executing actions by an agent, setting a strategy gradient according to the accumulated rewards, and circularly updating parameters of the predicted strategy network by adopting a gradient ascent method based on the strategy gradient; the learning goal of the MAAC algorithm is to learn a strategy for each agent to obtain the maximum accumulated benefit;

based on the parameters of the prediction strategy network and the prediction value network, circularly updating the parameters of the target strategy network and the target value network in a soft updating mode until learning training is completed;

and D2D users download parameters of the target strategy network after training, perform strategy improvement and select a communication mode, a resource block and/or communication power according to the observed current environment.

In a second aspect, a computer storage medium has instructions stored therein, which when executed on a computer, cause the computer to perform the method of the first aspect.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

(1) Aiming at communication scenes with dense user quantity and rapid scene change, the traditional algorithm is very difficult to process, and the model-free reinforcement learning algorithm adopted by the invention can effectively solve the decision problem in an uncertain environment.

(2) The MAAC algorithm framework adopted by the invention coordinates strategies among multiple agents, effectively overcomes the non-stationarity of the environment of the multiple agents, realizes the optimal energy efficiency of a communication system, and is suitable for complex and changeable communication scenes.

(3) The invention adopts a centralized training and distributed execution mode, and the D2D user uploads the useful information interacted with the environment to the experience pool, so that the complex training process is transferred to the base station for carrying out, and the training process of the intelligent body is more efficient.

(4) The D2D user pair of the present invention can operate in two modes: dedicated mode and multiplexing mode. The D2D user can preferentially select an idle channel to communicate, and before the multiplexing mode is selected, whether the cellular user and the D2D user still meet QoS requirements after the D2D user multiplexes the cellular user spectrum resources is detected in advance, multiplexing can be performed only under the condition that the QoS requirements are met, the frequency band utilization rate is improved, the failure rate of the cellular user data transmission is greatly reduced, and the reliability of the data transmission is guaranteed.

(5) According to the method, each D2D user can independently select transmission power on the premise of guaranteeing QoS quality according to the proposed algorithm, so that the situation that the D2D user always works at the highest transmission power for data transmission is avoided, and the power consumption of a system is reduced.

Drawings

FIG. 1 is a flow chart of a method for D2D user resource allocation based on a deep reinforcement learning algorithm;

fig. 2 is a schematic structural diagram of a wireless network model in embodiment 1;

FIG. 3 is a schematic diagram showing the interaction process of the agent and the environment in example 1;

FIG. 4 is a diagram illustrating a network update procedure in embodiment 1;

FIG. 5 is a schematic diagram illustrating information sharing between neighboring agents in embodiment 1;

fig. 6 is a schematic diagram of the training process in example 2.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a D2D user resource allocation method based on a deep reinforcement learning algorithm, referring to fig. 1, including:

constructing a wireless network model as shown in fig. 2, and discretizing the D2D transmitting power to generate K power levels; the wireless network model comprises a macro base station, L cellular users, N pairs of D2D user pairs and M orthogonal frequency spectrum resource blocks in the network coverage area of the macro base station, wherein parameters configured by the wireless network model comprise user positions;

modeling a D2D communication environment as a Markov decision process, regarding a D2D transmitter as an intelligent body, circularly loading parameters of a target strategy network pi' and then generating a strategy to interact with the environment, and determining a state space, an action space and a rewarding function; referring to fig. 3, on the premise of meeting QoS requirements, each agent selects a communication mode to be adopted at time t, executes action a according to a current observed state s, obtains reward r and transitions to a next state s ', and uploads experience groups (s,', r) to an experience pool for centralized training; wherein the communication mode comprises a dedicated mode, a multiplexing mode and a waiting mode, the states comprise position information and signal-to-noise ratio information of a D2D user and a cellular user, and the actions comprise selecting a power value and a resource block for communication;

In this embodiment, the markov decision process is a reinforcement learning model without a model, each D2D transmitter is regarded as an agent, in an unknown environment, the agent (i.e., the D2D user) improves the performance of the whole system through self-decision, and the self-adaptive learning is realized based on the continuous trial and error of interaction between a plurality of agents and the environment, which is particularly suitable for solving the problems of huge channel state information and large computational complexity caused by rapid scene change in the wireless communication field.

Meanwhile, the embodiment also adopts a MAAC (Multi-Agent Actor-Critic) algorithm, and the MAAC algorithm can be divided into two parts of centralized training and distributed execution; the centralized training is to transfer the complex multi-agent training process to the base station for carrying out, and the base station can easily deploy hardware equipment such as GPU (graphic processing unit) and the like, so that the calculation is accelerated; the distributed execution process refers to taking each D2D transmitter as an agent to interact with the environment for sampling, so that the signaling overhead and the calculation load of the base station are reduced. According to the embodiment, the optimal strategy is found for each D2D user through the MAAC algorithm, so that a strategy is found to maximize the energy efficiency of the whole system, and the problem of instability in the training and learning process is solved.

Exemplary parameters for which the wireless network model is configured include user location, but are not limited to, network coverage radius, base station location, channel gain, and/or number of resource blocks.

The QoS requirements are illustratively set based on a user minimum signal-to-noise ratio.

It can be understood that the experience pool stores historical experiences generated by interaction of the intelligent agent and the environment, and data in the experience pool is randomly sampled in small batches for training, so that correlation among samples can be reduced, and experience waste is avoided.

Illustratively, the experience pool is a region of limited size, and when the experience pool is full, the oldest experience will be discarded.

In one implementation, the empirical pool size is set to 4026 and the batch size is 128 for each sample.

In a preferred embodiment, the user signal-to-noise ratio calculation model includes SINR of the mth D2D user receiving end and SINR of the ith cellular user;

the expression of SINR of the mth D2D user receiving end is:

in the method, in the process of the invention,representing the transmit power of the D2D transmitter; />Representing the channel gain between the D2D transmitter and the D2D receiver; />Representing cellular resource sharing coefficients for distinguishing D2D communication modes, when an mth D2D user uses an idle channel for communication, i.e. does not multiplex a cellular user spectrum resource block, when there is no interference of a cellular user, then ∈ ->When the spectrum resource block of the cellular user is multiplexed, +.>Representing the transmit power of the cellular user; />Representing the channel gain of the cellular user to D2D; />Representing the D2D resource sharing coefficient, if other nth D2D users and mth D2D users multiplex the same resource block, the same resource block is added>Otherwise->Representing the transmit power of other D2D users; />Indicating the channel gain used by other D2D to the D2D user; sigma (sigma) ² Representing gaussian white noise;

SINR for the first cellular user, expressed as:

in the method, in the process of the invention,representing the transmit power of the cellular user; />Representing the channel gain of the macro base station to the cellular user; />Representing the resource block multiplexing coefficient, if->Representing that there is a D2D user multiplexed cellular user resource block, otherwise +.>Representing the transmission power of the nth D2D; />Representing the channel gain of D2D user n to cellular user l; sigma (sigma) ² Representing gaussian white noise;

the system throughput Tp expression is:

in the method, in the process of the invention,representing bandwidth between cellular user and macro base station, < >>Representing the bandwidth between the D2D transmitter and the D2D receiver; tp (Tp) ^C Representing the throughput at the cellular user side; tp (Tp) ^D Representing the throughput of the D2D user side;

the method comprises the steps of setting QoS requirements of communication between a D2D user pair and a cellular user, optimizing a wireless network model by taking the maximum throughput of a communication system consisting of the D2D user and the cellular user as an optimization target, and describing the wireless network model as the following expression:

maxTp(3a)

p ^C ＝C (3e)

wherein, the formula (3 a) represents the maximum optimization target of the system throughput, the formulas (3 b) and (3 c) represent SINR requirements of the D2D receiver and the cellular user, and the formulas (3D) and (3 e) represent limiting conditions of the transmission power of the D2D transmitter and the cellular user;representing D2D minimum signal-to-noise requirements; />Representing a minimum signal-to-noise ratio requirement for a cellular user; />Representing D2D minimum transmission power; />Representing the D2D highest transmission power; />Representing the transmit power of the nth D2D pair; p is p ^C Representing the transmit power of the cellular user; c is a constant and represents a fixed value for the transmit power of all cellular users in the environment.

In an alternative embodiment, the modeling the D2D communication environment as a markov decision process, regarding the D2D transmitter as an agent, circularly loading parameters of the target policy network pi ' and then generating a policy to interact with the environment, determining a state space, an action space and a reward function, on the premise of meeting the QoS requirement, each agent selects a communication mode to be adopted at the time t, executes the action a according to the currently observed state s, obtains the reward r and converts to the next state s ', and uploads the experience group (s, a, s ', r) to the experience pool for centralized training, specifically:

modeling the D2D communication environment as a markov decision process, treating the D2D transmitter as an agent;

the intelligent agent circularly loads parameters of a target strategy network pi ' and then generates a strategy to interact with the environment, a communication mode to be adopted is selected at the moment t, an action a is executed according to a state s observed at the moment t, a reward r is obtained, and the state s ' is converted into the next state s '; wherein, the actions executed by the agent are all performed under the constraint of QoS requirement;

defining the state space of the mth D2D user to the moment t asWherein (1)>Representing the basic information of the D2D user itself at time t, including the location information of the D2D user +.>User signal-to-noise ratio information +.>I.e.Representing cellular subscriber basic information including location information of cellular subscriber usersUser signal-to-noise ratio information +.>I.e. < ->

Defining the action space of the mth D2D user to the moment t asWherein (1)>Representing that the D2D user selects an x-th resource block, and sharing M dimensions; />Indicating that the z-th power level is selected for communication, and K choices are provided;

defining the rewards obtained by the mth user for executing the action at the moment t as:

wherein, the liquid crystal display device comprises a liquid crystal display device,is a constant less than 0; />Representing the signal-to-noise ratio at time t of the mth D2D user,/>Representing D2D user bandwidth;

the pre-conversion environment s, the performed action a, the post-conversion environment s 'and the reward r are uploaded to the experience pool in the form of experience groups (s, a, s', r).

Illustratively, when the mth user obtains rewards for the agent, the uploading operation of the corresponding experience group is performed.

Illustratively, when the mth user obtains a non-negative reward to the agent, an upload operation of the corresponding experience group is performed, otherwise the upload operation is not performed.

In a preferred embodiment, each agent selects a communication mode to be used at time t, including:

judging whether idle channels exist in the system: if yes, adopting a special mode to communicate;

otherwise, judging whether the QoS requirement of the D2D user and the cellular user is met after the resource blocks are multiplexed: if yes, the D2D user enters a special mode, and the shared cellular user resource is communicated; otherwise, entering a waiting mode, and not communicating until an idle channel exists in the system, and then initiating a communication request again.

In a preferred embodiment, the jackpot expression is:

wherein, gamma ⁿ Representing discount factor, and its value is 0,1]The interval is within;indicating a reward desire; />Representing an instant prize; n represents the discounted power of rewards for several steps in the future.

In an alternative embodiment, referring to fig. 4, the policy optimization is performed on each D2D user by using a MAAC algorithm, centralized training is performed by randomly sampling small amounts of data from an experience pool, updating a predicted value network by using a TD algorithm, updating parameters of the predicted value network by using a gradient descent method, calculating a cumulative prize based on a prize obtained by an action performed by an agent, setting a policy gradient according to the cumulative prize, and circularly updating parameters of the predicted policy network by using a gradient ascent method based on the policy gradient, including:

in a multi-agent environment, a prediction strategy network pi= { pi of all agents is adopted ₁ ,π ₂ …π _N A predictive value network q= { Q ₁ ,Q ₂ …Q _N The parameters of } are respectively defined asAnd->Target policy network pi ' = (pi ' of all agents ' ₁ ,π′ ₂ ......π′ _N ) Target value network Q ' = (Q ' ' ₁ ,Q′ ₂ ......Q′ _N ) The parameters of (2) are defined as +.>And->

Judging whether the number of experience groups stored in the experience pool meets a preset threshold value or not: if yes, executing centralized training, otherwise, not performing operation;

wherein the centralized training comprises:

randomly sampling small batches from an experience pool, and establishing a current round training data set;

the predictive strategy network of the ith agent is in state s _t Generating a selection action a for input by adopting epsilon-greedy strategy _t Policy a of (a), agent performs action a _t The state transitions to s' and gets rewards r _t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps of the process comprises, the epsilon-greied policy expression is:

wherein A represents an action strategy of an agent; the value of epsilon is continuously attenuated along with the learning process;

approximating the action cost function with the predictive value network, updating the predictive value network with the TD algorithm, and learning the Q function, i.e., the action cost function, with the Belman equation The predicted value network of the ith agent is based on the state s of the agent _t And action a _t For input, output action cost function->The target value network takes the converted state s 'and the next action a' as input and outputs the next action value function

According to the output of the predicted value network and the target value network, updating the predicted value network by minimizing a loss function by adopting a function approximation method; wherein the loss function expression is as follows:

wherein y is _i Is the target value, is generated by the target value network, y _i ＝r _i +γQ _i (s′,a'|θ ^Q ) Gamma represents discount factor, and the value is 0,1]In the interval, the smaller the gamma is, the less attention is paid to future benefits, and when the gamma is equal to 0, only the immediate benefits are considered, and as the gamma is more and more approaching to 1, the future benefits are more and more emphasized;outputting the predicted value as a predicted value by a predicted value network;

definition of TD-error asUpdating parameter θ of predictive value network using gradient descent method ^Q The TD-error is reduced, so that the prediction error is reduced;

according to the accumulated rewards of the ith agent, a strategy gradient is defined, and the expression is as follows:

in the method, in the process of the invention,representing the gradient of the Q function obtained in the predictive value network; />A deterministic policy gradient representing a predictive policy network; d represents an experience pool;

based on the strategy gradient, updating parameters of the prediction strategy network by adopting a gradient ascending method

In this alternative embodiment, the selection action a is generated using an ε -greedy policy _t The strategy A of (2) is to randomly select actions with a probability epsilon and select actions with a probability of 1-epsilon, so that the action value of the action value at the next moment is the largest, and the value of epsilon is continuously attenuated along with the learning process, so that the fact that the intelligent agent uses more exploration strategies to fully explore the whole state space in the beginning stage of learning, find all possible states, and the learning strategies are more mature along with the learning process, so that more greedy strategies are used, and the action with the largest current value is selected.

Illustratively, the next moment action a' is generated by the predictive strategy network.

Further, a neighbor user mechanism is introduced into the input of the predictive value network, specifically:

setting a distance constraint value Z _o ；

Distance Z to the ith agent _i～j Less than constraint value Z _o The jth agent is placed in neighbor set O _i ＝{D2D _j |Z _i～j ≤Z _o In j epsilon N, the ith agent and the jth agent are neighbor users; the distance between different intelligent agents is the distance between D2D transmitters, and the distance is calculated through a Euclidean distance formula; for position Z _i ＝(X _i ,Y _i ) Is Z _j ＝(X _j ,Y _j ) The expression of the distance between the j-th agent of (a) is:

the inputs to the forecast price network of the ith agent include the status and actions of the ith agent, and also the set O _i The state and action of the medium agent, and the action value function of the ith agent is output

Since the D2D users are far apart, the signal fades widely, and the interference is mainly related to D2D users sharing the same spectrum nearby, even if the far apart D2D users share the same spectrum resource, the interference between them can be almost ignored. By introducing a neighbor user mechanism, the base station determines whether information needs to be shared according to the distance information between D2D user pairs, so that collision of D2D pairs with a relatively close distance in the communication process due to the fact that the same frequency spectrum is selected is avoided, and the predictive value network can be based on a set O _i Status action information of other nearby agents is added to evaluate the quality of the operation. Referring to FIG. 5, a description is given ofAnd (3) information sharing process between adjacent intelligent agents. Compared with a method that a base station needs to acquire global D2D users to coordinate information, the method that only part of D2D users share information is selected, so that the calculation cost of the base station is greatly reduced, and the system performance is improved.

In a preferred embodiment, parameters of the predictive strategy network and predictive value networkAnd theta ^Q A qualification trace mechanism is introduced in the updating process of the system, specifically:

θ ^π ←θ ^π +α ^π δz ^π

θ ^Q ←θ ^Q +α ^Q δz ^Q

wherein, delta represents TD-error,an action cost function representing the predicted value network output; />Lambda-back representing n-step time sequence differential error, the expression is:

wherein T represents the final time; lambda is the attenuation factor parameter, and its value is in interval 0,1]In, when=0, λ return is G _t：t+1 I.e. single step return, the update algorithm of lambda return is the single step differential error algorithm, and when lambda=1, lambda return is G _t I.e. the updating algorithm of lambda return is the Monte Carlo algorithm;

z ^π representing qualification trace of predictive strategy network, z ^Q The qualification trace representing the predicted value network is updated as follows:

wherein lambda is the attenuation rate parameter, lambda is 0,1]The method comprises the steps of carrying out a first treatment on the surface of the Gamma is the discount coefficient;representing gradients of the prediction strategy network; />Representing gradients of the predictive value network; the qualification trace accumulates a gradient value at each step and decays with γλ, tracking the components of the weight vector that contributed either positively or negatively to the most recent state estimate.

The qualification trace is a vector which has the same dimension with the weight vector, is short-term memory, the qualification trace is used for assisting the learning process in the preferred embodiment, the weight vector is influenced, the weight vector determines the estimated value, and the introduction of the qualification trace mechanism enables the training process of the intelligent agent to be more efficient. In addition, compared with the traditional lambda-back method only adopting single-step time sequence differential errors, the lambda-back method adopting n-step time sequence differential errors in the preferred embodiment can remarkably improve the prediction accuracy.

In a preferred embodiment, the soft update procedure of the parameters of the target policy network and the target value network is as follows:

wherein θ ^π′ Parameters representing the target policy network; τ represents the parameter update coefficient, and the value is [0,1 ]]The interval is within; θ ^π Parameters representing a predictive policy network; θ ^Q′ Parameters representing a target value network; θ ^Q Parameters representing the target value network.

In a specific implementation, τ=0.01 is preset, so that the parameter updating of the target network is slow, and the learning stability is improved.

Example 2

In this embodiment, a simulation experiment is performed on the method proposed in embodiment 1, and the following simulation environment is set in consideration of the uplink of the cellular network in a single cell:

initializing a communication environment, setting the radius of a coverage area of a base station to be 500m, setting the base station to be positioned at the central position of a cell, setting the height of the base station to be 25m, randomly distributing 8 cellular users and 16D 2D users in the coverage area of the base station, enabling the users to move at a speed of 4-8km/h, and setting a distance constraint value Z of sharable information of the D2D users _o =50m, each cellular user is allocated a resource block with a bandwidth of 180khz, and d2d users communicate by multiplexing the resource blocks of the cellular users in the area.

The transmission power of the initialized cellular user is 46dBm, the transmission power of the D2D user is set to be [0,30], the minimum signal to noise ratio of the cellular user is required to be 1dB, gaussian white noise is-114 dBm, the path loss is 128.1+37.6log (R (km)), the bandwidth is 4MHz, and the carrier frequency is 2GHz.

Initializing a prediction strategy network pi, a prediction value network Q, a target strategy network pi 'and a target value network Q'; wherein, the prediction strategy network pi and the target strategy network pi' are all fully connected neural networks with two hidden layers, the number of the hidden layer neurons is set to 256 and 128 respectively, and the network learning rate is 0.0001; the predictive value network Q and the target value network Q' are all fully connected neural networks with three hidden layers, the number of neurons is 256, 128 and 64, respectively, and the network learning rate is 0.001. Optimizing the network by using an Adma optimizer, and setting the initial qualification trace of the network to be z _-1 ＝0。

Setting a negative prize valueFor-1, a discount factor γ=0.95 is set.

The empirical pool D is 4026 in size and the batch size per sample is 128.

A wireless network model is built comprising 1 base station, 8 cellular users and 8 pairs of D2D users.

And constructing a user signal-to-noise ratio calculation model, and optimizing the wireless network model by taking the maximum throughput of a communication system consisting of the D2D user and the cellular user as an optimization target.

Modeling a D2D communication environment as a Markov decision process, regarding a D2D transmitter as an intelligent agent, circularly loading parameters of a target strategy network, generating a strategy to interact with the environment, and determining a state space, an action space and a reward function; on the premise of meeting QoS requirements, each agent selects a communication mode to be adopted at the time t, executes action a according to the current observed state s, obtains rewards r and converts to the next state s', and uploads an experience group to an experience pool for centralized training.

Performing strategy optimization on each D2D user by adopting a MAAC algorithm, updating a predicted value network by adopting a TD algorithm, updating parameters of the predicted value network by adopting a gradient descent method, calculating accumulated rewards based on rewards obtained by an agent executing action, setting a strategy gradient according to the accumulated rewards, introducing a qualification trace mechanism, and circularly updating parameters of the predicted strategy network by adopting a gradient ascent method based on the strategy gradient;

and circularly updating the parameters of the target strategy network and the target value network in a soft updating mode based on the parameters of the prediction strategy network and the prediction value network.

Referring to fig. 6, by training 10000 rounds of the agent in the environment, setting 100 steps for each round, updating network parameters every 50 steps, recording rewarding conditions according to the training process of the agent, and continuously optimizing the strategy of the agent towards the direction of maximizing rewarding, thereby finally obtaining the resource allocation scheme with maximized energy efficiency.

Example 3

The present example provides a computer storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of embodiment 1.

The same or similar reference numerals correspond to the same or similar components;

the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;

it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The D2D user resource allocation method based on the deep reinforcement learning algorithm is characterized by comprising the following steps of:

modeling a D2D communication environment as a Markov decision process, regarding a D2D transmitter as an intelligent body, circularly loading parameters of a target strategy network pi' and then generating a strategy to interact with the environment, and determining a state space, an action space and a rewarding function; on the premise of meeting QoS requirements, each agent selects a communication mode to be adopted at the moment t, executes action a according to the current observed state s, obtains rewards r and converts the rewards into the next state s ', and uploads experience groups (s,', r) to an experience pool for centralized training; wherein the communication mode comprises a dedicated mode, a multiplexing mode and a waiting mode, the states comprise position information and signal-to-noise ratio information of a D2D user and a cellular user, and the actions comprise selecting a power value and a resource block for communication;

2. The D2D user resource allocation method based on the deep reinforcement learning algorithm of claim 1, wherein the user signal-to-noise ratio calculation model includes an SINR of an mth D2D user receiving end and an SINR of an mth cellular user;

the expression of SINR of the mth D2D user receiving end is:

in the method, in the process of the invention,representing the transmit power of the D2D transmitter; />Representing the channel gain between the D2D transmitter and the D2D receiver;representing cellular resource sharing coefficients for distinguishing D2D communication modes, when an mth D2D user uses an idle channel for communication, i.e. does not multiplex a cellular user spectrum resource block, when there is no interference of a cellular user, then ∈ ->When the spectrum resource block of the cellular user is multiplexed, +.> Representing the transmit power of the cellular user; />Representing the channel gain of the cellular user to D2D;indicating the D2D resource sharing coefficient, if other nth D2D users and mth D2D users multiplex the same resource block at the moment,otherwise-> Representing the transmit power of other D2D users; />Indicating the channel gain used by other D2D to the D2D user; sigma (sigma) ² Representing gaussian white noise;

SINR for the first cellular user, expressed as:

in the method, in the process of the invention,representing the transmit power of the cellular user; />Representing the channel gain of the macro base station to the cellular user; />Representing the resource block multiplexing coefficient, if->Representing that there is a D2D user multiplexed cellular user resource block, otherwise +.> Representing the transmission power of the nth D2D; />Representing the channel gain of D2D user n to cellular user l; sigma (sigma) ² Representing gaussian white noise;

the system throughput Tp expression is:

maxTp(3a)

p ^C ＝(3e)

wherein, the formula (3 a) represents the maximum optimization target of the system throughput, the formulas (3 b) and (3 c) represent SINR requirements of the D2D receiver and the cellular user, and the formulas (3D) and (3 e) represent limiting conditions of the transmission power of the D2D transmitter and the cellular user; gamma ray ^d* Representing D2D minimum signal-to-noise requirements; gamma ray ^C* Representing a minimum signal-to-noise ratio requirement for a cellular user;representing D2D minimum transmission power;representing D2D maximum transmission power; />Representing the transmit power of the nth D2D pair; p is p ^C Representing the transmit power of the cellular user; c is a constant and represents a fixed value for the transmit power of all cellular users in the environment.

3. The D2D user resource allocation method according to claim 2, wherein the D2D communication environment is modeled as a markov decision process, the D2D transmitter is regarded as an agent, the agent interacts with the environment by generating a policy after circularly loading parameters of a target policy network pi ', a state space, an action space and a reward function are determined, each agent selects a communication mode to be adopted at time t on the premise of meeting QoS requirements, an action a is executed according to a currently observed state s, a reward r is obtained and is converted to a next state s ', and experience groups (a, s ', r) are uploaded to an experience pool for centralized training, which is as follows:

defining the state space of the mth D2D user to the moment t asWherein (1)>Representing the basic information of the D2D user itself at time t, including the location information of the D2D user +.>User signal-to-noise ratio information +.>I.e. Representing cellular subscriber basic information including location information of cellular subscriber usersUser signal-to-noise ratio information +.>I.e. < ->

wherein, the liquid crystal display device comprises a liquid crystal display device,is a constant less than 0; /> Representing the signal-to-noise ratio at time t of the mth D2D user,/>Representing D2D user bandwidth;

4. The method for allocating D2D user resources based on the deep reinforcement learning algorithm of claim 1,

the method is characterized in that each intelligent agent selects a communication mode to be adopted at the time t, and the method comprises the following steps:

5. The D2D user resource allocation method based on the deep reinforcement learning algorithm of claim 1, wherein the cumulative bonus expression is:

wherein, gamma ⁿ Representing discount factor, and its value is 0,1]The interval is within;indicating a reward desire; />Indicating an instant prize.

6. The D2D user resource allocation method based on the deep reinforcement learning algorithm according to claim 5, wherein the policy optimization is performed on each D2D user by using the MAAC algorithm, the centralized training is performed by randomly sampling small amounts of data from the experience pool, the predicted value network is updated by using the TD algorithm, the parameters of the predicted value network are updated by using the gradient descent method, the cumulative rewards are calculated based on rewards obtained by performing actions by the agent, the policy gradient is set according to the cumulative rewards, and the parameters of the predicted policy network are cyclically updated by using the gradient ascent method based on the policy gradient, comprising:

wherein the centralized training comprises:

the predictive strategy network of the ith agent is in state s _t Generating a selection action a for input by adopting epsilon-greedy strategy _t Policy a of (a), agent performs action a _t The state transitions to s' and gets rewards r _t The method comprises the steps of carrying out a first treatment on the surface of the Wherein, epsilon-greed policy expression is:

approximating the action cost function with the predictive value network, updating the predictive value network with the TD algorithm, and learning the Q function, i.e., the action cost function, with the Belman equation Predictive value network of the ith agent to intelligentState s of body _t And action a _t For input, output action cost function->The target value network takes the converted state s 'and the next action a' as input and outputs the next action value function +.>

wherein y is _i Is the target value, is generated by the target value network, y _i ＝r _i +γQ _i (s'，a'|θ ^Q ) Gamma represents discount factor, and the value is 0,1]In the interval, the smaller the gamma is, the less attention is paid to future benefits, and when the gamma is equal to 0, only the immediate benefits are considered, and as the gamma is more and more approaching to 1, the future benefits are more and more emphasized;outputting the predicted value as a predicted value by a predicted value network;

7. The D2D user resource allocation method based on the deep reinforcement learning algorithm of claim 6, wherein the input of the predictive value network introduces a neighbor user mechanism, specifically: setting a distance constraint value Z _o ；

the input of the predicted value network of the ith agent includes the status and actions of the ith agent, and also includesSet O _i The state and action of the medium agent, and the action value function of the ith agent is output

8. The D2D user resource allocation method based on the deep reinforcement learning algorithm according to claim 1, wherein parameters of the predictive strategy network and the predictive value networkAnd theta ^Q A qualification trace mechanism is introduced in the updating process of the system, specifically:

θ ^π ←θ ^π +α ^π δz ^π

θ ^Q ←θ ^Q +α ^Q δz ^Q

wherein, delta represents TD-error, an action cost function representing the predicted value network output; />Lambda-back representing n-step time sequence differential error, the expression is:

wherein T represents the final time; lambda is the attenuation factor parameter, and its value is in interval 0,1]In, when=0, λ return is G _t：t+1 I.e. single step return, the update algorithm of lambda return is the single step differential error algorithm, and when lambda=1, lambda return is G _t I.e. updating algorithm of lambda returnThe Monte Carlo algorithm;

9. The D2D user resource allocation method based on the deep reinforcement learning algorithm according to claim 1, wherein the parameter soft update process of the target policy network and the target value network is as follows:

10. A computer storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of any of claims 1-9.