CN109729528A

CN109729528A - A kind of D2D resource allocation methods based on the study of multiple agent deeply

Info

Publication number: CN109729528A
Application number: CN201910161391.8A
Authority: CN
Inventors: 郭彩丽; 李政; 宣一荻; 冯春燕
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-12-21
Filing date: 2019-03-04
Publication date: 2019-05-07
Anticipated expiration: 2039-03-04
Also published as: CN109729528B

Abstract

The invention discloses a kind of D2D resource allocation methods based on the study of multiple agent deeply, belong to wireless communication field.Building cellular network first communicates the heterogeneous network model for sharing frequency spectrum with D2D, based on interfering existing for it, it establishes D2D and receives the Signal to Interference plus Noise Ratio SINR of the user and SINR of phone user, then after the unit bandwidth traffic rate for calculating separately cellular link and D2D link, power system capacity will be maximized as optimization aim, the D2D resource allocation optimization model in heterogeneous network is constructed；For time slot t, on the basis of D2D resource allocation optimization model, the deeply learning model of each D2D communication pair is constructed；Each D2D communication in subsequent timeslot is inputted in trained deeply learning model to respective state characteristic vector is extracted respectively, obtains the Resource Allocation Formula of each D2D communication pair.Present invention optimizes frequency spectrum distribution and transimission powers, maximise power system capacity, provide the resource allocation algorithm of low complex degree.

Description

A kind of D2D resource allocation methods based on the study of multiple agent deeply

Technical field

The invention belongs to wireless communication fields, are related to isomery beehive network system, specifically a kind of deep based on multiple agent Spend the D2D resource allocation methods of intensified learning.

Background technique

Universal and mobile Internet business the eruptive growth of intelligent terminal, transmits the data of cordless communication network More stringent requirements are proposed for ability.Under current main trend, there are frequency spectrum resource shortage and base stations for existing cellular network The problems such as overload, is not able to satisfy the transmission demand of future wireless network.

Device-to-device (D2D, Device-to-Device) communication permission adjacent user establishes direct link and communicates, Because it, which has, promotes the advantages such as spectrum efficiency, energy saving and unloading load of base station, become in future wireless system network A kind of very promising technology.Introduce D2D communication in cellular networks, on the one hand can with energy saving, improve edge customer Performance, the frequency spectrum that another aspect D2D communicates shared phone user can greatly promote the availability of frequency spectrum.

However, the frequency spectrum of D2D communication multiplexing cellular network can cause cross-layer to interfere cellular communication link, phone user makees It should be guaranteed for primary user's communication quality of cellular band, while in the case where the deployment of D2D communications-intensive, multiple D2D Communication link be multiplexed identical frequency spectrum will cause between same layer interference, so cellular network communicates when coexisting with D2D Interference management problem is a urgent problem to be solved.Wireless network resource distribution is intended to alleviate by reasonable resource distribution Interference promotes frequency spectrum resource utilization efficiency, is the effective way for solving the problems, such as above-mentioned interference management.

The existing research for D2D communication resource distribution in cellular network can be divided into centralized and distributed two class. Centralized approach assumes that base station has instant global channel status information (CSI, Channel State Information), By the resource allocation of base station control D2D user, but base station will obtain global channel status information and need huge signaling overheads, Under the wireless device scene of the following magnanimity, base station is difficult to possess instant global information, so intensive in future communications equipment Scene under, centralized algorithm no longer be applicable in.

Distributed method independently carries out the selection of wireless network resource by D2D user, and existing research is based primarily upon game By and intensified learning.D2D user modeling is that game player is at war with game by Game Theory, until Nash Equilibrium state, But Solving Nash Equilibrium state needs a large amount of information exchange between user, and needs a large amount of iteration that could restrain.It is based on The resource allocation research of intensified learning is based primarily upon Q study, such as depth Q network (DQN, Deep Q Network), and D2D is used Intelligent body is regarded at family as, and independent learning strategy carries out the selection of wireless network resource.But in multiple intelligent body learning trainings, often The strategy of a intelligent body is all changing, and it is unstable to will cause training environment, and training is not easy to restrain.Therefore need to study a kind of convergence Property good, distributed resource allocation algorithm that complexity is low solve the problems, such as D2D is communicated in cellular network interference management.

Summary of the invention

The present invention to solve the above-mentioned problems, is based on the deeply theories of learning, provides a kind of deep based on multiple agent The D2D resource allocation methods for spending intensified learning optimize the frequency spectrum distribution and transimission power of D2D user, realize cellular network It is maximized with the power system capacity of D2D communication, and ensure that the communication quality of phone user.

Specific steps include:

Step 1: building cellular network communicates the heterogeneous network model of shared frequency spectrum with D2D；

Heterogeneous network model includes cellular base station BS, M cellular downlink user and N number of D2D communication pair.

M-th of phone user is set as C_m, wherein 1≤m≤M；N-th D2D communication is to for D_n, wherein 1≤n≤N.D2D is logical Letter is to D_nIn transmitting user and receive user use respectivelyWithIt indicates.

Cellular downlink communication link and D2D link communication all use orthogonal frequency division multiplexi, and each phone user occupies One communication resource block RB is not interfered between any two cellular link；A phone user and multiple D2D is allowed to use simultaneously Identical RB is shared at family, and communication resource block RB and transimission power are independently selected by D2D user.

Step 2: based on interfering present in heterogeneous network model, establish D2D receive user Signal to Interference plus Noise Ratio SINR and The SINR of phone user；

Interference includes three types: 1) hair that centering is communicated from each D2D for sharing identical RB that phone user is subject to Penetrate the interference of user；2) interference from base station that the reception user of each D2D communication centering is subject to；3) each D2D communication pair In reception user be subject to from other it is all share identical RB D2D communication centerings transmitting users interference.

Phone user C_mThe signal SINR on k-th of communication resource block RB from base station received are as follows:

P_BIndicate the fixed transmission power of base station；For base station to phone user C_mDown target link channel increase Benefit；D_kRepresent set of all D2D communication to composition of shared k-th of RB；Indicate D2D communication to D_nThe hair of middle transmitting user Penetrate power；For as multiple link sharing RB, D2D is communicated to D_nMiddle transmitting userTo phone user C_mInterference chain The channel gain on road；N₀Represent the power spectral density of additive white Gaussian noise.

D2D is communicated to D_nReception user on k-th of RB reception signal SINR are as follows:

It communicates for D2D to D_nTransmitting userTo reception userD2D Target Link channel gain；For as multiple link sharing RB, base station to D2D is communicated to D_nReception userInterfering link channel gain；Indicate D2D communication to D_iThe transmission power of middle transmitting user；For as multiple link sharing RB, D2D is communicated to D_iIn Emit userTo reception userInterfering link channel gain；

Step 3: calculating separately cellular link and D2D chain using the SINR that the SINR and D2D of phone user receives user The unit bandwidth traffic rate on road；

The unit bandwidth traffic rate of cellular linkCalculation formula are as follows:

The unit bandwidth traffic rate of D2D linkCalculation formula are as follows:

Step 4: using the unit bandwidth traffic rate computing system capacity of cellular link and D2D link, and will maximize Power system capacity is optimization aim, constructs the D2D resource allocation optimization model in heterogeneous network；

Optimized model is as follows:

B_N×K=[b_n,k] be D2D communication pair communication resource block RB allocation matrix, b_n,kIt communicates for D2D to D_nRB choosing Parameter is selected,The power control vector collectively constituted for the transmission power of all D2D communication pair.

Constraint condition C1 indicates that the SINR of each phone user will be greater than the minimum threshold that phone user receives SINRGuarantee the communication quality of phone user；Constraint condition C2 characterizes D2D link spectral assignment constraints condition, and each D2D is used Family is to can only at most distribute a communication resource block RB；Constraint condition C3 characterizes the transmitting of the transmitting user of each D2D communication pair Power is no more than maximum transmission power thresholding P_max。

Step 5: being directed to time slot t, on the basis of D2D resource allocation optimization model, each D2D communication pair is constructed Deeply learning model；

Specific construction step is as follows:

Step 501 is communicated for some D2D to D_p, construct the state characteristic vector s in time slot t_t；

For the instantaneous channel state information of D2D communication link；It communicates for base station to the D2D to D_pMiddle reception user Interfering link instantaneous channel state information；I_t-1It communicates for the upper time slot t-1 D2D to D_pIt is middle to receive what user received Interference power values；It communicates for the upper time slot t-1 D2D to D_pNeighbouring D2D communicate to occupied RB；It is upper One time slot t-1 D2D communication is to D_pThe occupied RB of neighbouring phone user.

Step 502 constructs D2D communication to D simultaneously_pIn the Reward Program r of time slot t_t；

r_nBe negative return, r_n< 0；

Step 503, the shape that multiple agent Markov Game model is constructed using the state characteristic vector of D2D communication pair State feature；To optimize Markov Game model, multiple agent actor is established using the Reward Program of D2D communication pair and is commented on Reward Program in family's deeply learning model；

Each intelligent body Markov Game model Γ are as follows:

Wherein,It is state space,It is motion space, r^jIt is the corresponding return of Reward Program of j-th of D2D communication pair Return value, j ∈ { 1 ..., N }；P is the state transition probability of entire environment, and γ is discount factor.

Each D2D communication is to maximize total discount return of D2D communication pair to the target of study；

Total discount returns calculation formula are as follows:

T is time range；γ^tIt is the t power of discount factor；It is the Reward Program of j-th of D2D communication pair in time slot t Return value.

Actor reviewer's intensified learning model is made of actor (Actor) and reviewer (Critic)；

In training process, the strategy use deep neural network of actor is fitted, and uses following deterministic policy ladder Degree formula is updated, to obtain maximum expected returns.

Enable μ={ μ¹,...,μ^NIndicate the deterministic policies of all intelligent bodies, θ={ θ¹,...,θ^NIndicate that strategy is wrapped The parameter contained, the gradient formula of j-th of intelligent body expected returns are as follows:

S contains the status information of all intelligent bodies, s={ s¹,...,s^N}；A contains the movement letter of all intelligent bodies Breath, a={ a¹,...,a^N}；It is experience replay buffer area；

Reviewer is also fitted using deep neural network, by minimizing centralized movement-cost functionDamage Function is lost to update:

Wherein,Each sample is with tuple (s_t,a_t,r_t,s_t+1) form note The historical data of all intelligent bodies is recorded,It include return of all intelligent bodies in time slot t.

Step 504, usage history communication data carry out training under line to deeply learning model, obtain and solve the D2D Communicate D_pThe model of resource allocation problem.

Step 6: being trained respectively to each D2D communication in subsequent timeslot to respective state characteristic vector, input is extracted In good deeply learning model, the Resource Allocation Formula of each D2D communication pair is obtained.

Resource Allocation Formula includes choosing suitable communication resource block RB and transimission power.

The present invention has the advantages that

(1) a kind of D2D resource allocation methods based on the study of multiple agent deeply, optimize the frequency spectrum of D2D user Distribution and transimission power maximise power system capacity while guaranteeing cellular subscriber communications quality；

(2) a kind of D2D resource allocation methods based on the study of multiple agent deeply, devise in isomery cellular network D2D distributed resource allocation algorithm significantly reduces to obtain the signaling overheads that global instant channel status information generates；

(3) a kind of D2D resource allocation methods based on the study of multiple agent deeply, innovation introduce concentration instruction Practice, the multiple agent intensified learning model that distribution executes, solves more D2D communications to resource allocation problem, obtain good Training constringency performance, provides the resource allocation algorithm of low complex degree.

Detailed description of the invention

Fig. 1 is the heterogeneous network model schematic that the cellular network that the present invention constructs communicates shared frequency spectrum with D2D；

Fig. 2 is a kind of flow chart of the D2D resource allocation methods based on the study of multiple agent deeply of the present invention；

Fig. 3 is the deeply learning model schematic diagram that the present invention is used for D2D communication resource distribution；

Fig. 4 is single intelligent body actor reviewer intensified learning illustraton of model of the present invention；

Fig. 5 is multiple agent actor reviewer intensified learning illustraton of model of the present invention；

Fig. 6 is the phone user of the present invention with the D2D resource allocation methods based on DQN and D2D random resource allocation method Interruption rate comparison diagram.

Fig. 7 is that the present invention and the system of D2D resource allocation methods and D2D random resource allocation method based on DQN are always held Measure performance comparison figure.

Fig. 8 is that Total Return function of the invention and power system capacity constringency performance diagram are intended to；

Fig. 9 is that the present invention is based on the D2D resource allocation methods Total Return function of DQN and power system capacity constringency performance figures.

Specific embodiment

In order to enable the invention to be more clearly understood its technical principle, with reference to the accompanying drawing specifically, be set forth The embodiment of the present invention.

A kind of D2D resource allocation methods (MADRL, Multi-Agent Deep based on the study of multiple agent deeply Reinforcement Learning based Device-to-Device Resource Allocation Method) application It is communicated in the heterogeneous network coexisted in cellular network with D2D；The letter for establishing D2D reception user and phone user respectively first is dry It makes an uproar than being greater than with unit bandwidth traffic rate expression formula using maximizing power system capacity as optimization aim with the SINR of phone user The transmission power of minimum SINR thresholding, D2D link spectral assignment constraints condition and D2D transmitting user is less than maximum transmission power door Optimal conditions are limited to, the D2D resource allocation optimization model in heterogeneous network is constructed；

According to Optimized model, state feature of the building for the multiple agent deeply learning model of D2D resource allocation Vector sum Reward Program；It is theoretical based on partially observable Markov Game model and actor reviewer's intensified learning, it establishes Multiple agent actor reviewer's deeply learning model for D2D resource allocation；

Training under line is carried out using the historical communication data that emulation platform obtains；

The transient channel shape of the interfering link of user is received according to the instantaneous channel state information of D2D link, base station to D2D State information, upper time slot D2D receive the neighbouring D2D link of the interference power values that receive of user, the upper time slot D2D link and account for Shared by communication resource block (RB, Resource Block) and the neighbouring cellular subscriber communications of the upper time slot D2D link RB, the resource allocation policy obtained using training, chooses suitable RB and transimission power.

As shown in Fig. 2, whole includes establishing system model, proposes that optimization problem establishes Optimized model, establish multiple agent Intensified learning model, training pattern and execution five steps of algorithm；Wherein, establishing multiple agent intensified learning model includes building State feature designs Reward Program and establishes multiple agent actor reviewer's intensified learning model；

Specific step is as follows:

As shown in Figure 1, heterogeneous network model include cellular base station (BS, Base Station), M cellular downlink user with And N number of D2D communication pair.

Cellular downlink communication link and D2D link communication all use orthogonal frequency division multiplexing (OFDM, Orthogonal Frequency Division.Modulation) technology, one communication resource block RB of each phone user's occupancy, any two It is not interfered between cellular link；In system model, a phone user is allowed to share simultaneously with multiple D2D users identical RB is independently selected communication resource block RB and transimission power by D2D user.

Step 2: establishing the Signal to Interference plus Noise Ratio SINR that D2D receives user based on interfering present in heterogeneous network model The SINR of (Signal to Interference plus Noise Ratio) and phone user；

P_BIndicate the fixed transmission power of base station；For base station to phone user C_mDown target link channel increase Benefit；D_kRepresent set of all D2D communication to composition of shared k-th of RB；Indicate D2D communication to D_nThe hair of middle transmitting user Penetrate power；For as multiple link sharing RB, D2D is communicated to D_nMiddle transmitting userTo phone user C_mInterference chain The channel gain on road；N₀Represent the power spectrum of additive white Gaussian noise (AWGN, Additive White Gaussian Noise) Density.

Based on shannon formula, the unit bandwidth traffic rate of cellular linkCalculation formula are as follows:

The unit bandwidth traffic rate of D2D linkCalculation formula are as follows:

Due to needing under the premise of ensureing cellular subscriber communications quality, pass through the communication resource block of optimization D2D communication pair The allocation matrix B of RB_N×K=[b_n,k] and all D2D communication pair the power control vector that collectively constitutes of transmission powerPower system capacity is maximized, it is as follows to establish Optimized model:

b_n,kIt communicates for D2D to D_nRB selection parameter.

Constraint condition C1 characterizes the SINR constraint condition of phone user, indicates that the SINR of each phone user will be greater than bee The minimum threshold of nest user reception SINRGuarantee the communication quality of phone user；Constraint condition C2 characterizes D2D link frequency Assignment constraints condition is composed, each D2D user is to can only at most distribute a communication resource block RB；Constraint condition C3 characterization is each The transmission power of the transmitting user of D2D communication pair is no more than maximum transmission power thresholding P_max。

The intensified learning model for being used for D2D resource allocation is established, as shown in figure 3, principle is: in a time slot t, each D2D communication is to as an intelligent body, from state spaceIn observe a state s_t, then according to tactful π and current shape State is from motion spaceOne movement a of middle selection_t, i.e. D2D communication is to RB selected to use and transimission power；Execution acts a_tAfterwards, D2D communication is to observing that environment is transferred to a new state s_t+1, and obtain a return r_t, D2D communication is to according to being returned Report r_t, adjustable strategies π, to obtain higher return.Specific construction step is as follows:

Each D2D communication includes the following aspects to the state feature observed:

Step 502, simultaneously according to optimization aim, construct the D2D communication to D_pIn the Reward Program r of time slot t_t；

Design Reward Program needs while considering the minimum reception SINR thresholding of phone user and the unit band of D2D communication pair Wide rate.It can satisfy phone user's signal-to-noise ratio constraint item if communicating with D2D and receiving SINR to the phone user of shared frequency spectrum Part can then obtain a positive return；Conversely, a negative return r will be obtained_n, r_n< 0.In order to promote the appearance of D2D communication link Amount, sets positive return to the unit bandwidth traffic rate of D2D link:

Therefore, Reward Program is as follows:

Each intelligent body uses actor reviewer's intensified learning model, by actor (Actor) and reviewer (Critic) two parts form, as shown in figure 4, actor and the two-part strategy use deep neural network of reviewer are fitted It arrives.D2D actor networks input environment state s_t, output action a_t, that is, select RB and transimission power；Reviewer's network inputs ring Border state vector s_tWith the movement a of selection_t, export be calculated based on Q value time difference error (TD error, Temporal-Difference error), the study of two networks is driven by time difference error.

In isomery cellular network, the resource allocation of multiple D2D communications pair is the intensified learning problem an of multiple agent, can To be modeled as the Markov Game model of partially observable, the Markov Game model Γ of N number of intelligent body are as follows:

Wherein,It is state space,It is motion space, r^jIt is the return of j-th of intelligent body, value is j-th of D2D logical The corresponding return value of Reward Program of letter pair, j ∈ { 1 ..., N }；P is the state transition probability of entire environment, and γ is discount system Number.

The target of each intelligent body study is to maximize its total discount return；

Total discount returns calculation formula are as follows:

For Markov Game model, by actor reviewer's intensified learning model extension to multiple agent scene, structure The deeply learning model of multiple agent is built, as shown in Figure 5.In training, reviewer part usage history global information refers to Lead actor part more new strategy；But when being executed, single intelligent body only uses the component environment information that observation obtains, and uses training Obtained actor's strategy makes movement selection, realizes that concentration training, distribution execute.

During concentration training, strategy π={ π of N number of intelligent body¹,...,π^NIndicate, θ={ θ¹,...,θ^NIndicate The parameter that strategy is included, wherein j-th of intelligent body expected returnsGradient are as follows:

Here, s contains the status information of all intelligent bodies, s={ s¹,...,s^N}；A contains the dynamic of all intelligent bodies Make information, a={ a¹,...,a^N}；It is a centralized movement-cost function, by the status information of all intelligent bodies With movement as input, the Q value of j-th of intelligent body is exported.

Above description is expanded into deterministic policy, considers deterministic policy(it is abbreviated as μ^j), enable μ={ μ¹,..., μ^NIndicate the deterministic policies of all intelligent bodies, the gradient of j-th of intelligent body expected returns are as follows:

HereIt is experience replay buffer area, wherein each sample is with tuple (s_t,a_t,r_t,s_t+1) form record it is all The historical data of intelligent body, hereIt include return of all intelligent bodies in time slot t.The plan of actor part It is slightly fitted using deep neural network, above-mentioned gradient formula is the update method of actor networks, uses gradient rising side Method is updated, to obtain maximum expected returns.

Reviewer's network is also fitted using deep neural network, by minimizing centralized movement-cost function Loss function update:

Wherein,

Training step is as follows:

(1) communication simulation platform initialization cellular cell, base station, cellular link and D2D link are used；

(2) the Policy model π and parameter θ for initializing all intelligent bodies, initialize communication simulation timeslot number T；

(3) communication simulation time slot t ← 0 is initialized；

(4) all D2D communications obtain status information s to environment of observation_t, it is based on s_tA is acted with π selection_t, obtain return r_t, t←t+1；

(5) by (s_t,a_t,r_t,s_t+1) deposit experience replay buffer area

(6) fromThe middle small-sized batching data of sampling；

(7) it is trained using small-sized batching data, more the parameter θ of new strategy π；

(8) return step (4), until t=T, training terminates；

(9) return parameters θ；

It executes shown in steps are as follows:

(1) communication simulation platform initialization cellular cell, base station, cellular link, D2D link are used；

(2) trained parameter θ is imported model π by the Policy model π for initializing all intelligent bodies, and initialization communication is imitative True timeslot number T；

(3) communication simulation time slot t ← 0 is initialized；

(4) all D2D communications obtain status information s to environment of observation_t, it is based on s_tA is acted with π selection_t, i.e. RB and transmitting Power, statistics D2D receive the SINR and power system capacity of user；

(5) t ← t+1, emulation platform more new environment, all D2D communications obtain s to environment of observation_t+1；

(6) return step 4, until t=T.

By by the present invention is based on the D2D resource allocation methods of multiple agent with based on DQN D2D resource allocation methods and D2D random resource allocation method compares respectively；

As shown in fig. 6, MADRL indicates method of the invention, DQN indicates the resource allocation side D2D based on depth Q network Method, Random indicate that three kinds of methods are respectively to the shadow of cellular subscriber communications quality based on the D2D resource allocation methods being randomly assigned It rings, as seen from the figure, mentioned algorithm MADRL is in different D2D numbers of users by the present invention, can reach in minimum phone user Disconnected probability；

As shown in fig. 7, being influence of three kinds of methods to the total capacity of system, as D2D communicates the growth to quantity, this hair Bright mentioned algorithm MADRL achieves maximum power system capacity.

As shown in figure 8, indicating Total Return function and power system capacity constringency performance of the invention；As shown in figure 9, for based on The D2D resource allocation methods Total Return function and power system capacity convergence, the two of DQN is compared, and is had benefited from the present invention and is believed the overall situation Breath introduces training process and carries out centralized training, so that training environment is more stable, constringency performance is more preferable.Therefore deduce that knot By: MADRL can obtain throughput of system more higher than Random and DQN, together while protecting cellular subscriber communications quality When compared to DQN have better constringency performance.

In conclusion by implementing a kind of D2D resource allocation method based on multiple agent intensified learning of the present invention, Ke Yi While protecting cellular subscriber communications quality, maximum system throughput；Compared to centralized algorithm, divide designed by the present invention Cloth resource allocation algorithm, reduces signaling overheads；Compared to other based on Q study resource allocation algorithm, the present invention set by The algorithm of meter has better constringency performance.

The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

Claims

1. a kind of D2D resource allocation methods based on the study of multiple agent deeply, which is characterized in that specific steps include:

Heterogeneous network model includes cellular base station BS, M cellular downlink user and N number of D2D communication pair；

M-th of phone user is set as C_m, wherein 1≤m≤M；N-th D2D communication is to for D_n, wherein 1≤n≤N；D2D communication pair D_nIn transmitting user and receive user use respectivelyWithIt indicates；

Cellular downlink communication link and D2D link communication all use orthogonal frequency division multiplexi, and each phone user occupies one Communication resource block RB is not interfered between any two cellular link；Allow a phone user and multiple D2D users total simultaneously Identical RB is enjoyed, communication resource block RB and transimission power are independently selected by D2D user；

Step 2: establishing Signal to Interference plus Noise Ratio SINR and honeycomb that D2D receives user based on interfering present in heterogeneous network model The SINR of user；

P_BIndicate the fixed transmission power of base station；For base station to phone user C_mDown target link channel gain；D_k Represent set of all D2D communication to composition of shared k-th of RB；Indicate D2D communication to D_nThe transmitting function of middle transmitting user Rate；For as multiple link sharing RB, D2D is communicated to D_nMiddle transmitting userTo phone user C_mInterfering link Channel gain；N₀Represent the power spectral density of additive white Gaussian noise；

It communicates for D2D to D_nTransmitting userTo reception userD2D Target Link channel gain；For As multiple link sharing RB, base station to D2D is communicated to D_nReception userInterfering link channel gain；It indicates D2D is communicated to D_iThe transmission power of middle transmitting user；For as multiple link sharing RB, D2D is communicated to D_iMiddle transmitting is used FamilyTo reception userInterfering link channel gain；

Step 3: calculating separately cellular link and D2D link using the SINR that the SINR and D2D of phone user receives user Unit bandwidth traffic rate；

The unit bandwidth traffic rate of D2D linkCalculation formula are as follows:

Step 4: using the unit bandwidth traffic rate computing system capacity of cellular link and D2D link, and system will be maximized Capacity is optimization aim, constructs the D2D resource allocation optimization model in heterogeneous network；

Optimized model is as follows:

B_N×K=[b_n,k] be D2D communication pair communication resource block RB allocation matrix, b_n,kIt communicates for D2D to D_nRB select ginseng Number,The power control vector collectively constituted for the transmission power of all D2D communication pair；

Constraint condition C1 indicates that the SINR of each phone user will be greater than the minimum threshold that phone user receives SINR Guarantee the communication quality of phone user；Constraint condition C2 characterizes D2D link spectral assignment constraints condition, and each D2D user is to most A communication resource block RB can only mostly be distributed；Constraint condition C3 characterizes the transmission power of the transmitting user of each D2D communication pair not It can exceed that maximum transmission power thresholding P_max；

Step 5: being directed to time slot t, on the basis of D2D resource allocation optimization model, the depth of each D2D communication pair is constructed Intensified learning model；

Specific construction step is as follows:

For the instantaneous channel state information of D2D communication link；It communicates for base station to the D2D to D_pIt is middle to receive the dry of user Disturb the instantaneous channel state information of link；I_t-1It communicates for the upper time slot t-1 D2D to D_pThe middle interference for receiving user and receiving Performance number；It communicates for the upper time slot t-1 D2D to D_pNeighbouring D2D communicate to occupied RB；It is upper one The time slot t-1 D2D is communicated to D_pThe occupied RB of neighbouring phone user；

r_nBe negative return, r_n< 0；

Step 503, the state spy that multiple agent Markov Game model is constructed using the state characteristic vector of D2D communication pair Sign；To optimize Markov Game model, it is deep that multiple agent actor reviewer is established using the Reward Program of D2D communication pair Spend the Reward Program in intensified learning model；

Each intelligent body Markov Game model Γ are as follows:

Wherein,It is state space,It is motion space, r^jIt is returning for the corresponding return of Reward Program of j-th of D2D communication pair Report value, j ∈ { 1 ..., N }；P is the state transition probability of entire environment, and γ is discount factor；

Total discount returns calculation formula are as follows:

T is time range；γ^tIt is the t power of discount factor；r_t ^jIt is Reward Program the returning in time slot t of j-th of D2D communication pair Report value；

Step 504, usage history communication data carry out training under line to deeply learning model, obtain and solve D2D communication D_p The model of resource allocation problem；

Step 6: being inputted trained to each D2D communication in subsequent timeslot to respective state characteristic vector is extracted respectively In deeply learning model, the Resource Allocation Formula of each D2D communication pair is obtained.

2. a kind of D2D resource allocation methods based on the study of multiple agent deeply as described in claim 1, feature exist In interference described in step 2 includes three types: 1) phone user be subject to from share identical RB each D2D communicate pair In transmitting user interference；2) interference from base station that the reception user of each D2D communication centering is subject to；3) each D2D The interference from other all D2D communication centering transmitting users for sharing identical RB that the reception user of communication centering is subject to.

3. a kind of D2D resource allocation methods based on the study of multiple agent deeply as described in claim 1, feature exist In actor reviewer's intensified learning model, is made of actor and reviewer described in step 503；

In training process, the strategy use deep neural network of actor is fitted, public using following certainty Policy-Gradient Formula is updated, to obtain maximum expected returns；

Enable μ={ μ¹,...,μ^NIndicate the deterministic policies of all intelligent bodies, θ={ θ¹,...,θ^NIndicate the ginseng that strategy is included Number, the gradient formula of j-th of intelligent body expected returns are as follows:

S contains the status information of all intelligent bodies, s={ s¹,...,s^N}；A contains the action message of all intelligent bodies, a= {a¹,...,a^N}；It is experience replay buffer area；

Reviewer is also fitted using deep neural network, by minimizing centralized movement-cost functionLoss letter Number is to update:

Wherein,Each sample is with tuple (s_t,a_t,r_t,s_t+1) form record institute There are the historical data of intelligent body, r_t={ r_t ¹,...,r_t ^NIt include return of all intelligent bodies in time slot t.

4. a kind of D2D resource allocation methods based on the study of multiple agent deeply as described in claim 1, feature exist In Resource Allocation Formula described in step 6 includes choosing suitable communication resource block RB and transimission power.