CN109862610A

CN109862610A - A kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm

Info

Publication number: CN109862610A
Application number: CN201910013868.8A
Authority: CN
Inventors: ***; 张雪艳; 楼瀚琼; 葛晓虎; 肖泳; 黄晓庆
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-01-08
Filing date: 2019-01-08
Publication date: 2019-06-07
Anticipated expiration: 2039-01-08
Also published as: CN109862610B

Abstract

The invention discloses a kind of D2D subscriber resource distribution methods based on deeply study DDPG algorithm, the present invention utilizes phone user and D2D user related information, optimal D2D user channel allocations and transmission power combined optimization strategy are obtained using deeply learning method, D2D user is by selecting suitable transmission power and distribution channel, to reduce the interference to phone user, the information rate of itself is maximized simultaneously, efficient resource allocation is realized in the case where not influencing phone user QoS, improve the handling capacity of cellular network, meet the requirement of green communications.DDPG algorithm effectively solves the problems, such as the combined optimization of D2D user channel allocations and power control, it not only shows and stablizes in a series of optimization of Continuous action spaces, and it acquires time step required for optimal solution and is also far less than DQN, compared with the DRL method based on value function, the depth-size strategy gradient method optimisation strategy based on AC frame is more efficient, solving speed faster.

Description

A kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm

Technical field

The invention belongs to wireless communication technology fields, learn DDPG algorithm based on deeply more particularly, to one kind D2D subscriber resource distribution method.

Background technique

With the growing of local service is wirelessly communicated, cellular networks carry pressure is increasing.Terminal direct communication (D2D, Device-to-Device) technology allows adjacent terminal end under the control of base station as one of 5G Key Communication Technology, Data sharing is directly carried out from each other, forms data sharing network, shares the channel resource of cellular network to reach mitigation base Stand burden, promoted the availability of frequency spectrum, improve throughput of system purpose.

D2D communication is a kind of new technique for allowing directly to be communicated between terminal by sharing local resource, its energy The spectrum utilization efficiency for enough increasing cellular system, reduces terminal transmission power, lifting system at the load for mitigating cellular base stations Entire throughput solves the problems, such as that wireless communication system frequency spectrum resource is deficient to a certain extent.D2D user can use three kinds Mode is communicated: 1. honeycomb mode, the communication pattern as traditional cellular communication modes, i.e., by the relaying of base station come Realize the information transmission between two users.When the distance of two users farther out when, it will usually select honeycomb mode；2. dedicated channel Mode, under the mode, two users' direct communication does not need to relay by base station, uses dedicated channel；3. shared channel mould Formula, under the mode, two users' direct communication.It is different from dedicated channel mode, under shared channel mode, D2D user and shared bee Nest user (Cellular User, CU) shared channel.

In D2D model of communication system, D2D technical application can effectively be unloaded into base station flow into cellular communications network, The availability of frequency spectrum is improved, but D2D user can interfere the user accessed when sharing the channel of phone user, The performance for influencing user, causes system performance to decline.Therefore, how D2D user independently selects suitable communication channel and transmitting Power will directly affect the service quality of entire communication system.

Summary of the invention

In view of the drawbacks of the prior art, it is an object of the invention to solve D2D user in the prior art to use in shared honeycomb When the channel at family, the technical issues of being interfered to the user accessed, influence the performance of user.

To achieve the above object, in a first aspect, the embodiment of the invention provides one kind based on deeply study DDPG calculation The D2D subscriber resource distribution method of method uses shared channel pattern communication, the side between the D2D user and phone user Method the following steps are included:

Step S1. acquires the reachable data rate of D2D user and reachable data rate, the D2D of transmission power, phone user The shared channel information of user and phone user, and set the target data rate of phone user；

Step S2. is according to the reachable data rate and transmission power of D2D user, the reachable data rate of phone user and mesh The shared channel information for marking data rate, D2D user and phone user, establishes deeply learning model；

Step S3. utilizes DDPG algorithm optimization deeply learning model；

Step S4. obtains optimal D2D user emission power and channel distribution according to the deeply learning model after optimization Strategy.

Specifically, reachable data rate R of m-th of D2D user in moment t_m(t) calculation formula is as follows:

R_m(t)=Blog₂(1+Γ_m(t))

Wherein, B is channel width, Γ_mIt (t) is reception SINR of m-th of D2D user in moment t,It is m-th D2D user is to the transmission power in moment t, P_cFor the transmission power of phone user, h_m(t) it is used for the D2D of composition D2D user couple Channel coefficients between family, h_cIt (t) is phone user and the channel coefficients between the D2D user of its shared channel, σ₁ ²For bee Additive white Gaussian noise power in nest user and communication link between the D2D user of its shared channel；

With the phone user of m-th of D2D common user channel moment t reachable data rate R_c(t) calculation formula is such as Under:

R_c(t)=Blog₂(1+Γ_c(t))

Wherein, B is channel width, Γ_cIt (t) is phone user's connecing in moment t with m-th of D2D common user channel SINR is received,It is m-th of D2D user to the transmission power in moment t, P_cFor the transmission power of phone user, h_c(t) it is Channel coefficients between phone user and base station, h '_m(t) channel coefficients between D2D user and base station, σ₂ ²For D2D user Additive white Gaussian noise power in communication link between base station, 1≤m≤M, M are that the D2D of base station signal coverage area is used Family is to total number.

Specifically, for m-th of D2D user couple, in moment t shared channel information are as follows:

IfThen n-th of channel has simultaneously by phone user and m-th of D2D user to sharingAnd i ≠ n, i.e.,M is the D2D of base station signal coverage area For user to total number, N is base station available channel sum.

Specifically, the deeply learning model of foundation includes:

State space is phone user to the satisfaction of service quality, is in moment t definition status

If m-th of D2D user shares n-th of channel, have

Wherein, R_thFor the target data rate of phone user, R_cIt (t) is the reachable data rate of phone user, In the state of moment t when for m-th of D2D user to shared nth channel；

The motion space of D2D user includes two variables of transmission power and shared channel, is indicated are as follows:

Wherein,For m-th of D2D user moment t transmission power,For n-th of channel quilt Phone user and m-th of D2D user share situation；

The reward function of D2D user are as follows:

Wherein, R_cIt (t) is the reachable data rate of phone user, R_thFor the target data rate of phone user, R_m(t) it is The reachable data rate of D2D user, Ψ are negative constant；

Valuation functionsIt indicates from stateStart, selection executes movementThe folding generated afterwards Button reward, Q value renewal function are as follows:

Wherein,For instant reward function, γ is discount factor,It is m-th of D2D user to sharing the In the state of moment (t+1) when n channel,It is m-th of D2D user in the movement of moment (t+1), A is movementThe motion space of composition, N are base station available channel sum.

Specifically, it is described using DDPG algorithm optimization deeply learning model specifically includes the following steps:

S301. rounds p is trained to be initialized as 1；

Time step t in S302.p bout is initialized as 1；

S303. online Actor strategy network is according to input state s_t, output action a_t, and obtain instant reward r_t, together When go to NextState s_t+1, to obtain training data (s_t,a_t,r_t,s_t+1)；

S304. by training data (s_t,a_t,r_t,s_t+1) be stored in experience replay pond；

S305. T training data (s of stochastical sampling from experience replay pond_i,a_i,r_i,s_i+1) data set is constituted, it is sent to Online Actor strategy network, online Critic evaluation network, target Actor strategy network and target Critic evaluate network；

S306. the data set obtained according to sampling, target Actor strategy network is according to state s_i+1Output action a '_i+1, mesh Critic evaluation network is marked according to state s_i+1With the movement a ' of target Actor strategy network output_i+1, export valuation functions Q ' (s_i+1,a′_i+1| θ ') give loss gradient functionOnline Critic evaluation network is according to state s_i, movement a_iWith instant prize Encourage r_i, export valuation functions Q (s_i,a_i| θ) give Sampling Strategies gradientWith loss function gradientAccording to the loss function GradientUpdate online Critic evaluation network parameter θ；Online Actor strategy network will act a_iIt exports to Sampling Strategies GradientAnd according toUpdate online Actor strategy network parameter δ, 1≤i≤T；

S307. target network parameter δ ' and θ ' are updated according to online network parameter δ and θ respectively:

δ′←τδ+(1-τ)δ′；

θ′←τθ+(1-τ)θ′；

Wherein, τ is the weight of online network parameter；

S308. judging whether to meet t < K, K is the total time step in p bout, if so, t=t+1, enters step S303, Otherwise, S309 is entered step；

S309. judging whether to meet p < I, I is training rounds given threshold, if so, p=p+1, enters step S302, Otherwise, optimization terminates, the deeply learning model after being optimized.

Specifically, parameter updates gradient formula are as follows:

Specifically, step S4 specifically: the status information s at input system moment_m(t), optimal action policy is exportedObtain optimal D2D user emission powerWith distribution channel

Second aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage mediums Computer program is stored in matter, which realizes D2D user described in above-mentioned first aspect when being executed by processor Resource allocation methods.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect Fruit:

1. the present invention utilizes phone user and D2D user related information, proposes that deeply learns optimisation strategy, utilize depth Degree intensified learning method obtains optimal D2D user channel allocations and transmission power combined optimization strategy, D2D user and passes through choosing Suitable transmission power and distribution channel are selected, to reduce the interference to phone user, while maximizing the information rate of itself, It does not influence to realize efficient resource allocation in the case where phone user QoS, improves the handling capacity of cellular network, it is logical to meet green The requirement of letter.

2. the present invention can effectively solve D2D user channel allocations using DDPG algorithm and the combined optimization of power control is asked Topic not only shows in a series of optimization of Continuous action spaces and stablizes, but also it is also remote to acquire time step required for optimal solution Far fewer than DQN, compared with the DRL method based on value function, the depth-size strategy gradient method optimisation strategy efficiency based on AC frame Higher, solving speed is faster.

Detailed description of the invention

Fig. 1 is a kind of D2D user resources distribution for learning DDPG algorithm based on deeply provided in an embodiment of the present invention Method flow diagram；

Fig. 2 is D2D user resources distribution model schematic diagram provided in an embodiment of the present invention；

Fig. 3 is the deeply learning framework schematic diagram provided in an embodiment of the present invention based on Actor-Critic model；

Fig. 4 is DDPG algorithm frame schematic diagram provided in an embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

It is an object of the present invention to pass through the transmission power and channel assignment strategy of combined optimization D2D user, bee is not being influenced Under the premise of nest QoS of customer, the information rate of D2D user is maximized, improves the availability of frequency spectrum.Utilize deep learning side Method applies to the DDPG algorithm frame based on AC in the system model, available optimal D2D user power control and That is, in cellular networks channel assignment strategy to available one group of optimal transmission power and shares letter to any D2D user Road information makes it maximally improve the capacity of network on the basis of guaranteeing phone user QoS.

As shown in Figure 1, a kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm, the D2D are used Shared channel pattern communication is used between family and phone user, the described method comprises the following steps:

Step S3. utilizes DDPG algorithm optimization deeply learning model；

Step S1. acquires the reachable data rate of D2D user and reachable data rate, the D2D of transmission power, phone user The shared channel information of user and phone user, and set the target data rate of phone user.

As shown in Fig. 2, having in base station (BS, a Base station) coverage area in D2D user resources distribution model Multiple phone users and D2D user.D2D user can only be divided by transmitting information, each channel with phone user's shared channel One phone user of dispensing uses, and each phone user at a time can only be with a pair of of D2D common user channel.Due to sharing Channel, phone user and D2D user can generate interference between each other.

Assuming that there is M D2D user couple in a base station signal coverage area, there is N number of phone user in base station, and distribution is N number of can With channel, it is assumed that each channel can only distribute to a phone user and use.

For m-th of D2D user couple, in moment t shared channel information are as follows:

IfThen n-th of channel has simultaneously by phone user and m-th of D2D user to sharingAnd i ≠ n, i.e.,

Assuming that work between the users of different channels there is no interference, calculate separately phone user and D2D user when Carve the instantaneous received signal interference-to-noise ratio (SINR) of t.

Reception SINR calculation formula of m-th of D2D user in moment t is as follows:

Wherein,It is m-th of D2D user to the transmission power in moment t, P_cFor the transmission power of phone user, h_m (t) channel coefficients between the D2D user of composition D2D user couple, h_c(t) it is used for phone user and the D2D of its shared channel Channel coefficients between family, σ₁ ²The additive Gaussian in communication link between phone user and the D2D user of its shared channel White noise acoustical power.

Reachable data rate calculation formula of the corresponding D2D user in moment t is as follows:

R_m(t)=Blog₂(1+Γ_m(t))

Wherein, B is channel width, Γ_mIt (t) is reception SINR of m-th of D2D user in moment t.

Reception SINR calculation formula with the phone user of m-th of D2D common user channel in moment t is as follows:

Wherein,It is m-th of D2D user to the transmission power in moment t, P_cFor the transmission power of phone user, h_cChannel coefficients of ' (t) between phone user and base station, h '_m(t) channel coefficients between D2D user and base station, σ₂ ²For Additive white Gaussian noise power in communication link between D2D user and base station.

Reachable data rate calculation formula of the corresponding phone user in moment t is as follows:

R_c(t)=Blog₂(1+Γ_c(t))

Wherein, B is channel width, Γ_cIt (t) is phone user's connecing in moment t with m-th of D2D common user channel Receive SINR.

When the reachable data rate of phone user is more than or equal to the target data rate of phone user, phone user is to clothes Quality of being engaged in is satisfied；Otherwise, phone user is dissatisfied to service quality.By setting the target data rate of phone user, thus Control the service quality of communication system.

Step S2. is according to the reachable data rate and transmission power of D2D user, the reachable data rate of phone user and mesh The shared channel information for marking data rate, D2D user and phone user, establishes deeply learning model.

In order to efficiently solve the problems, such as the combined optimization in higher-dimension continuous space, using D2D user as intelligent body, it is strong to establish depth Change learning model, proposes that deeply learns optimisation strategy using phone user and D2D user related information.Guaranteeing honeycomb use Under the premise of the QoS of family, by the transmission power and channel assignment strategy of combined optimization D2D user, efficient resource allocation is realized, Improve power system capacity.

It is built in the case where known phone user's available channel and transmission power based on honeycomb and D2D model of communication system It stands using D2D user as the deeply learning model of intelligent body.Intensified learning mainly has 4 elements, i.e. strategy, reward, movement And environment.The target of intensified learning is one optimal policy of study, and the movement that intelligent body is selected obtains environment maximum Reward.Reward can be calculated with a function, also known as reward function.In order to measure the long-term effect of intensified learning, lead to Value function (value function) is commonly used to replace reward function, the not only instant reward of measurement movement is also measured from the shape State plays a series of reward that then possible states are accumulated.Environment, that is, state space, movement is exactly to allow in each state Motion space, reward be exactly select some movement enter some state obtain front or negative value.

The space state (State): definition status space is satisfaction of the phone user to service quality, is defined in moment t State is

If m-th of D2D user shares n-th of channel, have

Wherein, R_thFor the target data rate of phone user, R_cIt (t) is the reachable data rate of phone user,For In the state of moment t when m-th of D2D user is to shared nth channel；R_c(t)≥R_thWhen, i.e. the honeycomb on nth channel The QoS of user is met,R_c(t) < R_thWhen, i.e., the QoS of the phone user on nth channel is not expired Foot,

Act the space (Action): can be by adjusting the shared channel or transmission power of D2D user, to reduce to honeycomb The interference of user, while maximizing D2D user up to data rate, in m-th of D2D user of t moment to a function can only be selected Rate level and a shared channel can indicate so there are two variables in the motion space of D2D user are as follows:

Wherein,For m-th of D2D user moment t transmission power,For n-th of channel quilt Phone user and m-th of D2D user share situation, and A is movementThe motion space of composition.

Reward (Reward) function: D2D user takes corresponding movement that will obtain corresponding reward, defines D2D user Reward function are as follows:

Wherein, R_cIt (t) is the reachable data rate of phone user, R_thFor the target data rate of phone user, R_m(t) it is The reachable data rate of D2D user, Ψ are negative constant, represent the cost that a certain movement of selection needs to pay, i.e. movement cost.When When the QoS of phone user obtains meeting, the reachable data rate of D2D user is exactly its reward, otherwise, is punished a certain dynamic to select The cost of work.

In the present invention, deeply learning algorithm is established on the basis of Q study.Q study is a kind of reinforcement of model-free Learning algorithm, valuation functionsIt indicates from stateStart, selection executes movementAfter generate Maximum-discount reward, Q value renewal function are as follows:

Wherein,For reward function, γ is discount factor, represents the importance of the following reward, if γ is close to 0, D2D User mainly considers to reward at once；γ mainly looks ahead reward close to 1, D2D user；For m-th of D2D user In the state of moment (t+1) when to shared nth channel,For m-th of D2D user moment (t+1) movement.

Step S3. utilizes DDPG algorithm optimization deeply learning model.

Motion space in deeply study includes two variables of transmission power and shared channel, considers that transmission power exists Consecutive variations in a certain range, in order to solve this higher-dimension motion space, the especially combined optimization in Continuous action space is asked Topic introduces a kind of depth based on action family-reviewer (Actor-Critic, AC) frame by Q study in conjunction with neural network Spend deterministic policy gradient (Deep Deterministic Policy Gradient, DDPG) algorithm.In DDPG algorithm, both There is Actor strategy network, and there is Critic to evaluate network, the parameter of the two networks can be optimized by training.DDPG is calculated Method uses the Actor-Critic framework of intensified learning, is made of 4 neural networks: the identical Actor policy network of 2 structures Network, respectively online Actor strategy network and target Actor strategy network；The identical Critic of 2 structures evaluates network, point It Wei not online Critic evaluation network and target Critic evaluation network.Wherein, target Actor strategy network and target Critic Network is evaluated mainly for generation of training dataset, and online Actor strategy network and online Critic evaluation network are mainly used Optimize network parameter in training.As shown in figure 3, Actor is responsible for through Policy-Gradient learning strategy in AC frame, and Critic is responsible for estimating value function by Policy evaluation.One side Actor learning strategy, and stragetic innovation relies on Critic estimation Value function；Another aspect Critic estimates value function, and value function is the function of strategy.Strategy and value function each other according to Rely, influences each other, it is therefore desirable to iteration optimization in the training process.

The input of Actor strategy network is s_t, output is a certain movement a_t.Tactful network is used for strategic function approximation π (s_t| δ)≈π^*(s_t), wherein δ is Actor strategy network parameter.Generally, π (s_t| δ) parameter δ should towards make Q value increase Direction updates.Define J (δ)=E_s[Q(a_t,s_t| θ)], wherein E_sExpectation, a are asked in () expression_t=π (s_t| δ), find D2D user The process of optimum behavior strategy maximizes the process of J (δ).

Critic evaluates the movement (s that the input of network is the state of D2D user's t moment and takes_t,a_t), output is corresponding Q (s_t,a_t| θ) and state s_t+1.Critic evaluates network and is used for valuation functions approximation Q (s_t,a_t|θ)≈Q^*(s_t,a_t), wherein θ is that Critic evaluates network parameter, reduces the loss function between target network and online network by updating θ value:

Loss=E [Q ' (s_t,a_t′|θ′)-Q(s_t,a_t|θ)]²

Wherein, Q ' (s_t,a_t' | θ ') be target network valuation functions, Q (s_t,a_t| θ) be online network valuation functions.

The method for having used experience pond to play back in DDPG optimization algorithm.Deep neural network is wanted as supervised learning model It asks sample data mutually indepedent, but is highlights correlations in time by the sample that Q learning algorithm obtains, if these data sequences Column are directly used in training, will lead to the over-fitting of neural network, are not easy to restrain.DDPG algorithm by each timing node of intelligent body with Transfer sample (the s that environmental interaction obtains_t,a_t,r_t,s_t+1) be all stored in experience replay pond, then from experience replay pond with Machine extracts T sample data (s_i,a_i,r_i,s_i+1) Lai Xunlian neural network, sampling obtained data in this way may be considered mutually Between onrelevant, 1≤i≤T.

According to sample data (s_i,a_i,r_i,s_i+1) J (δ)=E can be obtained_s[Q(a_i,s_i| θ)] and loss function Loss=E [Q′(s_i,a_i′|θ′)-Q(s_i,a_i|θ)]², then carry out optimization neural network parameter using gradient descent method, it is public that parameter updates gradient Formula are as follows:

DDPG algorithm improves the learning efficiency of system, enhances the stability of learning process.Wherein, online network passes through Stochastic gradient descent (Stochastic Gradient Decent) scheduling algorithm utilizes gradient updating parameter, and target network passes through soft Update undated parameter.Target network Parameters variation is small, for providing some letters needed for online network updates in the training process Breath；Online network parameter real-time update, after every excessively specified step number, the parameter of online network can be copied to target network.Target The introducing of network keeps learning process more stable, and training is easy to restrain, and is exactly by the system of certain iterative steps after training Optimal system.

As shown in figure 4, it is described using DDPG algorithm optimization deeply learning model specifically includes the following steps:

S301. rounds p is trained to be initialized as 1；

Time step t in S302.p bout is initialized as 1；

S306. the data set obtained according to sampling, target Actor strategy network is according to state s_i+1Output action a_i+₁, mesh Critic evaluation network is marked according to state s_i+1With the movement a ' of target Actor strategy network output_i+1, export valuation functions Q ' (s_i+1,a′_i+1| θ ') give loss gradient functionOnline Critic evaluation network is according to state s_i, movement a_iWith instant prize Encourage r_i, export valuation functions Q (s_i,a_i| θ) give Sampling Strategies gradientWith loss function gradientAccording to the loss function GradientUndated parameter θ；Online Actor strategy network will act a_iIt exports and gives Sampling Strategies gradientAnd according to Undated parameter δ, 1≤i≤T；

δ′←τδ+(1-τ)δ′；

θ′←τθ+(1-τ)θ′。

Using the trained deeply learning model of DDPG algorithm, the available optimal channel distribution of D2D user and Power control strategy, the status information s at input system moment_m(t), optimal action policy is exportedObtain optimal D2D transmission powerWith distribution channelTo not influence The capacity of communication system is improved on the basis of phone user QoS.

More than, the only preferable specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any Within the technical scope of the present application, any changes or substitutions that can be easily thought of by those familiar with the art, all answers Cover within the scope of protection of this application.Therefore, the protection scope of the application should be subject to the protection scope in claims.

Claims

1. a kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm, which is characterized in that the D2D is used Shared channel pattern communication is used between family and phone user, the described method comprises the following steps:

The reachable data rate and transmission power, the reachable data rate of phone user, D2D user of step S1. acquisition D2D user With the shared channel information of phone user, and the target data rate of phone user is set；

Step S2. is according to the reachable data rate and transmission power of D2D user, the reachable data rate of phone user and number of targets According to rate, D2D user and the shared channel of phone user information, deeply learning model is established；

Step S3. utilizes DDPG algorithm optimization deeply learning model；

Step S4. obtains optimal D2D user emission power and channel distribution plan according to the deeply learning model after optimization Slightly.

2. D2D subscriber resource distribution method as described in claim 1, which is characterized in that m-th of D2D user can moment t's Up to data rate R_m(t) calculation formula is as follows:

R_m(t)=Blog₂(1+Γ_m(t))

Wherein, B is channel width, Γ_mIt (t) is reception SINR of m-th of D2D user in moment t,It is used for m-th of D2D Family is to the transmission power in moment t, P_cFor the transmission power of phone user, h_m(t) between the D2D user for composition D2D user couple Channel coefficients, h_cIt (t) is phone user and the channel coefficients between the D2D user of its shared channel, σ₁ ²For phone user And the additive white Gaussian noise power in the communication link between the D2D user of its shared channel；

With the phone user of m-th of D2D common user channel moment t reachable data rate R_c(t) calculation formula is as follows:

R_c(t)=Blog₂(1+Γ_c(t))

Wherein, B is channel width, Γ_c(t) for the phone user of m-th of D2D common user channel moment t reception SINR,It is m-th of D2D user to the transmission power in moment t, P_cFor the transmission power of phone user, h '_cIt (t) is bee Channel coefficients between nest user and base station, h '_m(t) channel coefficients between D2D user and base station, σ₂ ²For D2D user with Additive white Gaussian noise power in communication link between base station, 1≤m≤M, M are the D2D user of base station signal coverage area To total number.

3. D2D subscriber resource distribution method as described in claim 1, which is characterized in that for m-th of D2D user couple, Moment t shared channel information are as follows:

4. D2D subscriber resource distribution method as described in claim 1, which is characterized in that the deeply learning model of foundation Include:

If m-th of D2D user shares n-th of channel, have

Wherein, R_thFor the target data rate of phone user, R_cIt (t) is the reachable data rate of phone user,For m In the state of moment t when a D2D user is to shared nth channel；

Wherein,For m-th of D2D user moment t transmission power,It is used for n-th of channel by honeycomb Family and m-th of D2D user share situation；

The reward function of D2D user are as follows:

Wherein, R_cIt (t) is the reachable data rate of phone user, R_thFor the target data rate of phone user, R_m(t) it is used for D2D The reachable data rate at family, Ψ are negative constant；

Valuation functionsIt indicates from stateStart, selection executes movementThe discount prize generated afterwards It encourages, Q value renewal function are as follows:

Wherein,For instant reward function, γ is discount factor,Shared nth is believed for m-th of D2D user In the state of moment (t+1) when road,It is m-th of D2D user in the movement of moment (t+1), A is movementStructure At motion space, N be base station available channel sum.

5. D2D subscriber resource distribution method as described in claim 1, which is characterized in that described deep using DDPG algorithm optimization Spend intensified learning model specifically includes the following steps:

S301. rounds p is trained to be initialized as 1；

Time step t in S302.p bout is initialized as 1；

S303. online Actor strategy network is according to input state s_t, output action a_t, and obtain instant reward r_t, turn simultaneously To NextState s_t+1, to obtain training data (s_t,a_t,r_t,s_t+1)；

S306. the data set obtained according to sampling, target Actor strategy network is according to state s_i+1Output action a '_i+1, target Critic evaluates network according to state s_i+1With the movement a ' of target Actor strategy network output_i+1, export valuation functions Q ' (s_i+1,a′_i+1| θ ') give loss gradient functionOnline Critic evaluation network is according to state s_i, movement a_iWith instant prize Encourage r_i, export valuation functions Q (s_i,a_i| θ) give Sampling Strategies gradientWith loss function gradientAccording to the loss function GradientUpdate online Critic evaluation network parameter θ；Online Actor strategy network will act a_iIt exports to Sampling Strategies GradientAnd according toUpdate online Actor strategy network parameter δ, 1≤i≤T；

δ′←τδ+(1-τ)δ′；

θ′←τθ+(1-τ)θ′；

Wherein, τ is the weight of online network parameter；

S308. judging whether to meet t < K, K is the total time step in p bout, if so, t=t+1, enters step S303, otherwise, Enter step S309；

S309. judging whether to meet p < I, I is training rounds given threshold, if so, p=p+1, enters step S302, it is no Then, optimization terminates, the deeply learning model after being optimized.

6. D2D subscriber resource distribution method as claimed in claim 5, which is characterized in that parameter updates gradient formula are as follows:

7. D2D subscriber resource distribution method as described in claim 1, which is characterized in that step S4 specifically: input system The status information s at moment_m(t), optimal action policy is exportedObtain optimal D2D user's hair Penetrate powerWith distribution channel

8. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program, the computer program realize D2D user resources as described in any one of claim 1 to 7 point when being executed by processor Method of completing the square.