CN116406004A - Construction method and resource management method of wireless network resource allocation system - Google Patents

Construction method and resource management method of wireless network resource allocation system Download PDF

Info

Publication number
CN116406004A
CN116406004A CN202310354794.0A CN202310354794A CN116406004A CN 116406004 A CN116406004 A CN 116406004A CN 202310354794 A CN202310354794 A CN 202310354794A CN 116406004 A CN116406004 A CN 116406004A
Authority
CN
China
Prior art keywords
channel
time slot
power
allocation
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310354794.0A
Other languages
Chinese (zh)
Inventor
时宁哲
刘玲
周一青
石晶林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202310354794.0A priority Critical patent/CN116406004A/en
Publication of CN116406004A publication Critical patent/CN116406004A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0473Wireless resource allocation based on the type of the allocated resource the resource being transmission power
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/535Allocation or scheduling criteria for wireless resources based on resource usage policies
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention provides a construction method of a wireless network resource allocation system, which is used for obtaining a wireless network resource allocation strategy according to a wireless network state, and comprises the following steps: s1, acquiring a non-convex optimization target with interruption probability constraint corresponding to wireless communication requirements in a non-perfect global channel state information environment; s2, converting the obtained non-convex optimization target to obtain a non-convex optimization target without interruption probability constraint; s3, acquiring imperfect global channel state information of the wireless network; s4, training the initial resource allocation system to be converged by taking the obtained non-convex optimization target as a training target and taking the imperfect global channel state information in the step S3 as input and adopting a reinforcement learning mode. The invention adopts more practical CSI (imperfect global channel state information) to train the initial distribution system based on learning, improves the convergence rate of the wireless network resource distribution system and improves the performance of completing the optimization target.

Description

Construction method and resource management method of wireless network resource allocation system
Technical Field
The present invention relates to the field of wireless communications, and more particularly, to the field of wireless communication network resource allocation, and more particularly, to a method for constructing a wireless network resource allocation system, a wireless network resource management method based thereon, and a wireless communication system.
Background
In the prior art, the transmission rate and the network capacity of users in a wireless communication network are improved by increasing the spatial frequency spectrum reuse rate in the wireless communication network and deploying a large number of wireless Access Points (APs) in the wireless communication network. However, in a deployment scenario where wireless access points are dense and irregular, there may be particularly severe co-channel interference (co-channel interference, CCI) in the wireless communication network. In addition, as the number of Base Stations (BS) deployed in a wireless communication network increases, unreasonable allocation of wireless network resources may further increase CCI and reduce communication performance such as spectrum efficiency of the wireless network. Accordingly, there is a need to reduce CCI in wireless networks and improve communication performance such as spectral efficiency of wireless networks by optimizing wireless network resource allocation (e.g., channel allocation policies and power allocation policies).
In the prior art, two main methods for solving the problem of resource allocation in a wireless network are available, namely an optimization algorithm based on model driving and an optimization algorithm based on learning.
The optimization algorithm based on model driving generally assumes perfect global channel state information (channel state information, CSI) to optimize the resource allocation problem, and when the optimization algorithm is applied to an actual wireless communication environment, the optimization algorithm has the disadvantage of excessively high computational complexity, thereby causing larger time delay and higher energy consumption, and is difficult to deploy and apply in practice because of suboptimal performance in solving the resource allocation problem in a wireless network.
Whereas learning-based optimization algorithms are typically based on deep reinforcement learning to achieve optimization. The deep reinforcement learning processes complex and high-dimensional environmental features based on the strong perception capability of the deep learning, and interacts with the environment in combination with the idea of the reinforcement learning to complete the decision process, so the deep reinforcement learning (deep reinforcement learning, DRL) is successfully applied to various fields (unmanned decision, industrial robot control and recommendation system). For the wireless communication field, resource allocation in a wireless communication environment can also be modeled as a dynamic decision process due to the dynamics of the wireless communication environment. Therefore, the method for managing the wireless resources based on the deep reinforcement learning is applied to the wireless resource allocation task and can solve the problems of the traditional wireless resource allocation method. Compared with a resource optimization algorithm based on model driving, the optimization algorithm based on learning can effectively reduce the computational complexity of resource allocation, and is more likely to be deployed and applied in future wireless network architecture. In the prior art, in the technical field of wireless communication, a learning-based optimization algorithm commonly uses perfect CSI to allocate resources in a wireless network. However, due to the error of channel estimation and the objective existence of the delay of channel feedback, the true perfect CSI is difficult to obtain, so that in the task of radio resource management, it is necessary to consider the more practical imperfect CSI in the radio environment, and as can be seen from the researches in references [1] to [8], the optimization based on the imperfect CSI is more practical. However, as mentioned above, the learning-based optimization method in the prior art is generally based on perfect CSI, for example, references [3] to [7] and [9] each design an optimization target based on perfect CSI, and the algorithm converges at a slower speed and has low performance such as spectrum efficiency. Also, as can be seen from the study of references [10] to [12], perfect CSI is difficult to obtain in a practical environment.
In summary, the existing learning-based method is not designed for the imperfect CSI, and the channel estimation error in the actual communication environment exists objectively and cannot be completely eliminated, so that the optimization target effect (i.e., the communication performance) achieved by directly adopting the existing learning-based algorithm in the imperfect CSI environment is poor and the convergence rate of the algorithm is low. Therefore, there is a need for a more efficient DRL architecture that enables resource allocation policy optimization in wireless networks based on imperfect CSI.
Reference is made to:
[1]Y.Teng,M.Liu,F.R.Yu,V.C.M.Leung,M.Song,and Y.Zhang,“Resource allocation for ultra-dense networks:A survey,some research issues and challenges,”IEEE Commun.Surv.Tut.,vol.21,no.3,pp.2134–2168,Jul.–Sep.2019.
[2]L.Liu,Y.Zhou,W.Zhuang,J.Yuan,and L.Tian,“Tractable coverage analysis for hexagonal macrocell-based heterogeneous UDNs with adaptive interference-aware CoMP,”IEEE Trans.Wireless Commun.,vol.18,no.1,pp.503–517,Jan.2019.
[3]Y.Zhang,C.Kang,T.Ma,Y.Teng,and D.Guo,“Power allocation in multi-cell networks using deep reinforcement learning,”in Proc.IEEE 88th Veh.Technol.Conf.(VTC-Fall),2018,pp.1–6.
[4]S.Lahoud,K.Khawam,S.Martin,G.Feng,Z.Liang,and J.Nasreddine,“Energy-efficient joint scheduling and power control in multicell wireless networks,”IEEE J.Sel.Areas Commun.,vol.34,no.12,pp.3409–3426,Dec.2016.
[5]K.Shen and W.Yu,“Fractional programming for communication systems—Part I:Power control and beamforming,”IEEE Trans.Signal Process.,vol.66,no.10,pp.2616—2630,May 2018.
[6]F.Meng,P.Chen,L.Wu,and J.Cheng,“Power allocation in multi-user cellular networks:Deep reinforcement learning approaches,”IEEE Trans.Wireless Commun.,vol.19,no.10,pp.6255–6267,Oct.2020.
[7]J.Tan,Y.-C.Liang,L.Zhang,and G.Feng,“Deep reinforcement learning for joint channel selection and power control in D2D networks,”IEEE Trans.Wireless Commun.,vol.20,no.2,pp.1363–1378,Feb.2021.
[8]Y.Guo,F.Zheng,J.Luo,and X.Wang,“Optimal resource allocation via machine learning in coordinated downlink multi-cell OFDM networks under imperfect CSI,”in Proc.Veh.Technol.Conf.(VTC-Spring),2020,pp.1–6
[9]Y.S.Nasir and D.Guo,“Deep Reinforcement Learning for Joint Spectrum and Power Allocation in Cellular Networks,”2021 IEEE Globecom Workshops(GC Wkshps),2021,pp.1-6.
[10]T.Yoo and A.Goldsmith,“Capacity and power allocation for fading MIMO channels with channel estimation error,”IEEE Trans.Inf.Theory,vol.52,no.5,pp.2203–2214,May 2006.
[11]F.Fang,H.Zhang,J.Cheng,S.Roy,and V.C.M.Leung,“Joint user scheduling and power allocation optimization for energy-efficient NOMA systems with imperfect CSI,”IEEE J.Sel.Areas Commun.,vol.35,no.12,pp.2874–2885,Dec.2017.
[12]X.Wang,F.-C.Zheng,P.Zhu,and X.You,“Energy-efficient resource allocation in coordinated downlink multicell OFDMA systems,”IEEE Trans.Veh.Technol.,vol.65,no.3,pp.1395–1408,Mar.2016.
disclosure of Invention
It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method for constructing a radio network resource allocation system, a radio network resource management method based thereon, and a radio communication system.
The invention aims at realizing the following technical scheme:
according to a first aspect of the present invention, there is provided a method for constructing a radio network resource allocation system for obtaining a radio network resource allocation policy according to a radio network state, the method comprising: s1, acquiring a non-convex optimization target with interruption probability constraint corresponding to wireless communication requirements in a non-perfect global channel state information environment; s2, converting the non-convex optimization target obtained in the step S1 to obtain a non-convex optimization target without interruption probability constraint; s3, acquiring imperfect global channel state information of the wireless network; s4, taking the non-convex optimization target in the step S2 as a training target, taking the imperfect global channel state information in the step S3 as input, and training an initial resource allocation system to be converged in a reinforcement learning mode, wherein the initial resource allocation system is a system which is constructed based on an intelligent agent and is used for generating an action set based on a wireless network state, and the action set comprises a channel allocation strategy and a power allocation strategy.
In some embodiments of the invention, the wireless communication requirement is to maximize the spectral efficiency of the wireless network, the non-convex optimization objective with outage probability constraints:
Figure BDA0004162958540000041
wherein,,
Figure BDA0004162958540000042
Figure BDA0004162958540000043
Figure BDA0004162958540000044
Figure BDA0004162958540000045
Figure BDA0004162958540000046
Figure BDA0004162958540000047
Figure BDA0004162958540000048
wherein,,
Figure BDA0004162958540000049
representing the average spectral efficiency of the wireless network in time slot t, K representing the total number of links N representing the total number of sub-channels,/->
Figure BDA00041629585400000410
Representing a set of sub-channel indices,/->
Figure BDA00041629585400000411
Indicating the scheduling spectral efficiency of the kth link selecting the nth sub-channel in time slot t,/for>
Figure BDA00041629585400000412
Indicating the maximum spectral efficiency of the kth link selecting the nth sub-channel in time slot t,/for the kth link>
Figure BDA00041629585400000413
Representing the estimated small-scale fading component of the kth link selecting the nth sub-channel in time slot t,/>
Figure BDA00041629585400000414
Representing the estimated small-scale fading component +.>
Figure BDA00041629585400000415
Is->
Figure BDA00041629585400000416
Probability of->
Figure BDA00041629585400000417
Indicating the power, p, of the kth link to select the nth sub-channel in time slot t t Indicating all->
Figure BDA00041629585400000418
Power set, +.>
Figure BDA00041629585400000419
Representing the identification value, alpha, of the kth link after the selection of the nth sub-channel in time slot t t Indicating all->
Figure BDA00041629585400000420
Identification value-made set epsilon out Representing the expected outage probability, P max Representing the power threshold of the link, constraint M1 represents the estimated small-scale fading component +.>
Figure BDA00041629585400000421
For the condition that the probability of any one link being interrupted after any one sub-channel is selected in the time slot t is smaller than the expected interruption probability, the constraint M2 indicates that the transmitting power on each link cannot be higher than the power threshold of the link, and the constraints M3 and M4 indicate that each link can only select one sub-channel in each time slot.
In some embodiments of the present invention, in the step S2, the non-convex optimization objective is transformed by means of a parametric transformation to obtain a non-convex optimization objective without the constraint of outage probability:
Figure BDA0004162958540000051
wherein,,
Figure BDA0004162958540000052
Figure BDA0004162958540000053
Figure BDA0004162958540000054
Figure BDA0004162958540000055
Figure BDA0004162958540000056
Figure BDA0004162958540000057
wherein Ω t The wireless network is represented by the average spectrum efficiency in time slot t after parameter transformation.
In some embodiments of the invention, the initial resource allocation system comprises: a channel allocation model for predicting a channel allocation policy of a slot based on imperfect global channel state information of the slot, which is configured as a DQN network, a DDQN network, or a Dueling DQN network; the power allocation model is used to predict a power allocation policy for a slot based on imperfect global channel state information for the slot, which is configured as a DDPG network.
In some embodiments of the invention, the step S4 includes a step S41, a step S42, and a step S43. Wherein, step S41 includes: acquiring imperfect global channel state information of an input time slot and executing the following steps: s411, predicting a channel allocation strategy of an input time slot according to the imperfect global channel state information of the input time slot by a channel allocation model, updating the imperfect global channel state information of the input time slot based on the predicted channel allocation strategy, and predicting the power allocation strategy of the input time slot by a power allocation model according to the updated imperfect global channel state information of the input time slot; the channel allocation strategy and the power allocation strategy of the predicted input time slot interact with the wireless network to obtain imperfect global channel state information of the next time slot of the input time slot, the channel allocation strategy of the next time slot of the input time slot is predicted by a channel allocation model according to the imperfect global channel state information of the next time slot of the input time slot, and the channel allocation strategy of the next time slot of the input time slot is updated based on the channel allocation strategy of the next time slot of the input time slot; s412, calculating the frequency spectrum efficiency rewards of the input time slot based on the channel allocation strategy and the power allocation strategy of the input time slot; s413, storing a piece of channel allocation experience into a channel selection playback pool according to the non-perfect global channel state information of the input time slot, the channel allocation strategy of the input time slot, the frequency spectrum efficiency rewards of the input time slot and the non-perfect global channel state information of the next time slot of the input time slot; and storing the experience of power allocation into a power selection playback pool by using the updated non-perfect global channel state information of the input time slot, the power allocation strategy of the input time slot, the frequency spectrum efficiency rewards of the input time slot and the non-perfect global channel state information of the next time slot of the updated input time slot. Step S42 includes: the non-perfect global channel state information of the next slot of the last input slot is the non-perfect global channel state information of the new input slot. Step S43 includes: the initial resource allocation system parameters are updated based on the channel allocation experience in the channel selection playback pool and the power allocation in the power selection playback pool until convergence.
In some embodiments of the present invention, in the step S43, updating parameters of the channel allocation model is started when there is a channel allocation experience in the channel selection playback pool; updating parameters of the power allocation model begins when there is a power allocation experience in the power selection playback pool.
In some embodiments of the invention, in said step S43; updating parameters of the channel allocation model for a plurality of times until convergence is achieved after the channel allocation experience in the channel selection playback pool reaches the experience of a preset number, wherein the channel allocation experience is obtained by randomly sampling from the channel selection playback pool during each updating, and the parameters of the channel allocation model are updated in a gradient descent mode based on the sampled channel allocation experience; and updating the parameters of the power distribution model for a plurality of times until convergence is achieved after the power distribution experience in the power selection playback pool reaches the preset number of experiences, wherein a plurality of power distribution experiences are obtained by randomly sampling from the power selection playback pool when each update is performed, and the parameters of the power distribution model are updated in a gradient descent mode based on the sampled power distribution experiences.
In some embodiments of the present invention, in step S41, the non-perfect global channel state information of the input slot includes non-perfect global channel state information of a plurality of links selecting different sub-channels in the input slot:
Figure BDA0004162958540000061
Wherein,,
Figure BDA0004162958540000062
Figure BDA0004162958540000063
wherein,,
Figure BDA0004162958540000064
a state set representing the kth link selecting the nth sub-channel in time slot t,/for>
Figure BDA0004162958540000065
Independent channel gain indicating that the kth link selects the nth sub-channel in time slot t in the presence of channel estimation errors,/>
Figure BDA0004162958540000066
Indicating the channel power of the kth link selecting the nth sub-channel in time slot t,/for>
Figure BDA0004162958540000067
An identification value indicating that the kth link has selected the nth sub-channel in time slot t-1,/for>
Figure BDA0004162958540000068
Indicating the power of the kth link selecting the nth sub-channel in time slot t-1, +.>
Figure BDA0004162958540000069
Representing the spectral efficiency of the kth link corresponding to time slot t-1, < >>
Figure BDA00041629585400000610
Representing the k-th link selecting the estimated small-scale fading component corresponding to the n-th sub-channel in time slot t +.>
Figure BDA00041629585400000611
The ratio to total interference power is the rank value over all channels,
Figure BDA00041629585400000612
indicating that the kth link selects the kth in time slot tCo-channel interference under the sub-channel allocation scheme and the power allocation scheme of the last slot when n sub-channels are used, k' represents other links than k, +.>
Figure BDA0004162958540000071
Representing the variance of the channel estimation error,/and>
Figure BDA0004162958540000072
is a large-scale fading component considering shadow fading and geometric fading,/->
Figure BDA0004162958540000073
Mean value 0, variance ++>
Figure BDA0004162958540000074
Is a complex gaussian distribution of (c).
In some embodiments of the present invention, the spectral efficiency rewards are calculated as follows:
Figure BDA0004162958540000075
Wherein,,
Figure BDA0004162958540000076
Figure BDA0004162958540000077
wherein,,
Figure BDA0004162958540000078
representing the spectral efficiency, epsilon, of the kth link corresponding to the nth sub-channel selected in time slot t out Representing the expected outage probability, +.>
Figure BDA0004162958540000079
Indicating the scheduling spectral efficiency of the kth link in the time slot t for selecting the nth sub-channel, phi is the weight coefficient of the interference,k' represents a link other than k, < ->
Figure BDA00041629585400000710
Indicating the external interference of the kth link selecting the nth sub-channel in time slot t +.>
Figure BDA00041629585400000711
Indicating the spectral efficiency of link k' without kth link interference in the nth sub-channel of time slot t, for>
Figure BDA00041629585400000712
Indicating the spectral efficiency corresponding to the kth' link selecting the nth sub-channel in time slot t.
In some embodiments of the present invention, the DDPG network comprises an Actor network and a Critic network, and the final resource allocation system is: training to a converged DQN network, DDQN network, or a sustaining DQN network and an Actor network.
According to a second aspect of the present invention, there is provided a wireless network resource management method, the method comprising: t1, acquiring a wireless network state of a wireless communication system in a last time slot; t2, based on the wireless network state of the last time slot obtained in the step T1, predicting a resource allocation strategy of the next time by adopting the resource allocation system obtained by the method of the first aspect of the invention; and T3, distributing wireless network resources in the wireless communication system based on the resource distribution strategy of the next moment obtained in the step T2.
According to a third aspect of the present invention there is provided a wireless communication system comprising a plurality of base stations, each base station comprising a radio resource management unit configured to allocate radio network resources in the base station using the method according to the second aspect of the present invention.
Compared with the prior art, the invention has the advantages that: by adopting the non-convex optimization target with the outage probability constraint corresponding to the wireless communication requirement in the non-perfect global channel state information environment as the training target, the channel estimation error in the actual communication environment can be fully considered, namely, the initial resource allocation system based on learning is trained by adopting more practical CSI (non-perfect global channel state information), so that the convergence rate of the wireless network resource allocation system is improved, and the performance of completing the optimization target is improved.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
fig. 1 is a flow chart illustrating a method for constructing a wireless network resource allocation system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an initial distribution system consisting of a Dueling DQN network and a DDPG network for model training and parameter updating architecture according to an embodiment of the present invention;
Fig. 3 is a flowchart illustrating a method for managing wireless network resources according to an embodiment of the present invention;
FIG. 4 is a graph showing the convergence performance of the algorithm proposed in this patent and the four baseline algorithms described above, according to an embodiment of the present invention;
FIG. 5 is a graph showing the relationship between the frequency spectrum efficiency and the variance of the channel estimation error, which can be achieved by the algorithm proposed in the patent and the four baseline algorithms according to the embodiment of the present invention;
fig. 6 is a schematic diagram showing the performance comparison of the algorithm proposed in this patent and the above four baseline algorithms under different numbers of subchannels according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by means of specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
As mentioned in the background art, the existing learning-based method is not designed for the imperfect CSI, and the channel estimation error in the actual communication environment is objectively present and cannot be completely eliminated, so that the optimization objective effect achieved by directly adopting the existing learning-based algorithm in the imperfect CSI environment is poor and the convergence rate of the algorithm is low. In order to solve the above problems, the present invention proposes a wireless network resource allocation scheme based on imperfect CSI, starting from the characteristics of imperfect CSI. The invention further converts the non-convex optimization target with the outage probability constraint into the non-convex optimization target without the outage probability constraint and solves the problem by adopting a learning-based method, so as to improve the effect of the optimization target and the convergence speed of an algorithm which can be achieved in the non-perfect CSI environment.
For a better understanding of the present invention, the following detailed description of the invention is made with reference to the drawings and examples.
According to an embodiment of the present invention, there is provided a method for constructing a wireless network resource allocation system for obtaining a wireless network resource allocation policy according to a wireless network state, as shown in fig. 1, the method including: s1, acquiring a non-convex optimization target with interruption probability constraint corresponding to wireless communication requirements in a non-perfect global channel state information environment; s2, converting the non-convex optimization target obtained in the step S1 to obtain a non-convex optimization target without interruption probability constraint; s3, acquiring imperfect global channel state information of the wireless network; s4, taking the non-convex optimization target in the step S2 as a training target, taking the imperfect global channel state information in the step S3 as input, and training an initial resource allocation system to be converged in a reinforcement learning mode, wherein the initial resource allocation system is a system which is constructed based on an intelligent agent and is used for generating an action set based on a wireless network state, and the action set comprises a channel allocation strategy and a power allocation strategy. In order to better describe the specific scheme of the invention, the following is a detailed explanation of several aspects of establishing a non-convex optimization target without interruption probability constraint, model training and experimental verification.
1. Establishment of non-convex optimization targets without outage probability constraints
Since the existing learning-based network resource allocation methods are not designed for imperfect CSI, the following will explain in detail the establishment and conversion from the optimization objective of the wireless communication network for better understanding. Wherein, for ease of understanding, the process is described in terms of formula derivation in the embodiments of the present invention.
The invention firstly carries out mathematical description on the wireless network environment under imperfect CSI, and further models the wireless network environment based on the mathematical description of the environment. The wireless communication network comprises a plurality of communication areas, each communication area is provided with a base station and a plurality of users, all users in the plurality of communication areas share a plurality of sub-channels, each base station is positioned in the center of the area, and authorized users are randomly distributed in the communication area; all users and transceivers of the base station are equipped with one antenna and each formed link can only select one subchannel in one time slot. For example, one downlink in a wireless network has a multi-cell multi-user network scenario, where K links are distributed among M cells and share N orthogonal subchannels, where,
Figure BDA0004162958540000091
And->
Figure BDA0004162958540000092
Respectively representing a link index set, a cell index set, and a subchannel index set.
In a wireless communication environment, considering a fully synchronized system with a time interval, the independent channel gain of the kth link to select the nth sub-channel at time slot t can be expressed as:
Figure BDA0004162958540000093
wherein,,
Figure BDA0004162958540000094
represents a large-scale fading component taking into account shadow fading and geometric fading, wherein +.>
Figure BDA0004162958540000095
Is constant in a plurality of time slots; />
Figure BDA0004162958540000096
Representing the estimated small-scale fading component of the kth link selecting the nth sub-channel at time slot t.
In a wireless communication environment, considering normalized bandwidth, in case of perfect CSI, the kth link selects the maximum spectral efficiency of the nth sub-channel at time slot t:
Figure BDA0004162958540000097
wherein,,
Figure BDA0004162958540000098
representing the identification value of the kth link after the selection of the nth sub-channel in time slot t, e.g +.>
Figure BDA0004162958540000099
Indicating that the kth link selects the nth sub-channel in time slot t, otherwise +.>
Figure BDA00041629585400000910
Figure BDA00041629585400000911
Representing the power, sigma, of the kth link selecting the nth sub-channel in time slot t 2 Power representing additive white gaussian noise, +.>
Figure BDA0004162958540000101
Indicating the co-channel interference experienced by the kth link when the nth sub-channel is selected in time slot t.
In a wireless communication environment, due to unavoidable channel estimation errors, in perfect CSI,
Figure BDA0004162958540000102
Assuming a true value, suchIt is assumed that channel estimation errors in the actual communication environment are ignored. Thus, objective estimation of the small-scale fading component is required. Assuming that the base station can perfectly estimate the large-scale fading coefficient because it changes slowly, and the small-scale fading coefficient changes rapidly, so that it cannot be perfectly estimated, then in one embodiment of the present invention, based on the imperfect CSI, the estimated small-scale fading component of the nth sub-channel selected by the kth link at the time slot t is expressed as:
Figure BDA0004162958540000103
wherein,,
Figure BDA0004162958540000104
wherein,,
Figure BDA0004162958540000105
representing the estimated small-scale fading component of the kth link selecting the nth sub-channel at time slot t,
Figure BDA0004162958540000106
representing the error of the estimated small-scale fading component of the nth sub-channel selected by the kth link at time slot t, and each
Figure BDA0004162958540000107
Independent of each other. />
Figure BDA0004162958540000108
Mean value 0, variance ++>
Figure BDA0004162958540000109
Is a complex Gaussian distribution of->
Figure BDA00041629585400001010
Mean value 0, variance ++>
Figure BDA00041629585400001011
Is a complex Gaussian distribution of->
Figure BDA00041629585400001012
Representing the variance of the channel estimation error. It should be noted that the defect of the existing perfect CSI information mainly means that the fading coefficient of small scale cannot be estimated perfectly, see formula (3). Channel estimation value of fading coefficient of small scale due to existence of channel estimation error and other factors +. >
Figure BDA00041629585400001013
Typically not equal to the true value. If the algorithm based on the existing perfect CSI is directly applied to the environment of the imperfect CSI, the method is equivalent to directly adding the estimated value +.>
Figure BDA00041629585400001014
The resource allocation is performed as a true value, and since there is an error (i.e., a channel estimation error) between the estimated value and the true value, if the network resource allocation is performed directly using the algorithm based on the perfect CSI, the effect of improving the transmission performance and the network capacity is general. In practice, there must be a channel estimation error and other factors, so that CSI cannot be perfectly estimated, and therefore, this imperfect CSI factor needs to be considered, and directly using a resource allocation algorithm based on the existing perfect CSI is equivalent to directly replacing the actual value with the estimated value, which may reduce the performance of the algorithm.
After the mathematical description of the above wireless network environment, the following will require modeling of the optimization problem (i.e., wireless communication requirements). The optimization problem needs to include at least maximizing throughput and maximizing spectral efficiency. Since the maximized throughput and the maximized spectral efficiency can be converted from each other by a formula in definition, in the embodiment of the present invention, the maximized spectral efficiency is taken as an example for modeling explanation, and the modeling of the maximized throughput is not described herein in detail.
Due to the non-perfect CSI, the scheduled spectral efficiency may exceed the maximum achievable spectral efficiency defined by shannon capacity formula. Thus, when scheduling spectral efficiencyBeyond the spectral efficiency achievable with imperfect CSI, outage probability is used as an indicator of performance. The scheduling spectral efficiency when the kth link selects the nth sub-channel at time slot t is expressed as
Figure BDA00041629585400001015
The average spectral efficiency of the wireless network at time slot t is given by:
Figure BDA0004162958540000111
further, in the time slot t, under the imperfect CSI, the non-convex optimization objective with outage probability constraint corresponding to the maximum spectrum efficiency of the wireless network is:
Figure BDA0004162958540000112
wherein,,
Figure BDA0004162958540000113
Figure BDA0004162958540000114
Figure BDA0004162958540000115
Figure BDA0004162958540000116
Figure BDA0004162958540000117
Figure BDA0004162958540000118
Figure BDA0004162958540000119
wherein,,
Figure BDA00041629585400001110
representing the average spectral efficiency of the wireless network in time slot t, K representing the total number of links N representing the total number of sub-channels,/->
Figure BDA00041629585400001111
Representing a set of sub-channel indices,/->
Figure BDA00041629585400001112
Indicating the scheduling spectral efficiency of the kth link selecting the nth sub-channel in time slot t,/for>
Figure BDA00041629585400001113
Indicating the maximum spectral efficiency of the kth link selecting the nth sub-channel in time slot t,/for the kth link>
Figure BDA00041629585400001114
Representing the estimated small-scale fading component of the kth link selecting the nth sub-channel in time slot t,/>
Figure BDA00041629585400001115
Representing the estimated small-scale fading component +.>
Figure BDA00041629585400001116
Is->
Figure BDA00041629585400001117
Probability of->
Figure BDA00041629585400001118
Indicating the power, p, of the kth link to select the nth sub-channel in time slot t t Indicating all->
Figure BDA00041629585400001119
Power set, +.>
Figure BDA00041629585400001120
Representing the identification value, alpha, of the kth link after the selection of the nth sub-channel in time slot t t Indicating all->
Figure BDA00041629585400001121
Identification value-made set epsilon out Representing the expected outage probability, P max Representing the power threshold of the link, constraint M1 represents the estimated small-scale fading component +.>
Figure BDA00041629585400001122
For the condition that the probability of any one link being interrupted after any one sub-channel is selected in the time slot t is smaller than the expected interruption probability, the constraint M2 indicates that the transmitting power on each link cannot be higher than the power threshold of the link, and the constraints M3 and M4 indicate that each link can only select one sub-channel in each time slot.
Since the non-convex optimization objective with outage probability constraints has proven to be an NP-Hard problem (a problem to which all non-deterministic polynomial problems can be reduced within polynomial time complexity) even when the subchannel strategy in the radio network resource is fixed, considering only the power allocation problem, the optimal solution of the non-convex optimization objective with outage probability constraints can be solved directly by mathematical derivation. Aiming at solving the problem, the invention adopts a parameter transformation mode to convert an original optimization target (namely, a non-convex optimization target with interrupt probability constraint) into a non-convex optimization target without interrupt probability constraint (through replacement of constraint conditions and corresponding solving conversion), so that the non-convex optimization target with interrupt probability constraint corresponding to the maximum frequency spectrum efficiency of the wireless network can be solved. The process of converting the original optimization objectives by means of parametric transformations will be described in detail in the following from the two parts of constraint replacement and optimization problem conversion.
In the constraint replacement process, the inventor considers a stricter constraint R1 to replace the outage probability constraint M1, so that the constraint R1 always meets the outage probability constraint M1 in the non-convex optimization target, wherein the constraint R1 is:
Figure BDA0004162958540000121
wherein,,
Figure BDA0004162958540000122
representing the noise and interference signal strength of the kth link under the definition of shannon's formula for selecting the nth sub-channel at time slot t,/>
Figure BDA0004162958540000123
Representing the noise and interference signal strength of the kth link in the actual schedulable selection of the nth sub-channel in time slot t,/->
Figure BDA0004162958540000124
Indicating the useful signal strength of the kth link in the actual schedulable selection of the nth sub-channel in time slot t,/for>
Figure BDA0004162958540000125
Indicating the useful signal strength of the kth link under the definition of shannon's formula to select the nth sub-channel at time slot t. Constraint R1-1 is expressed in +.>
Figure BDA0004162958540000126
For all k and n under the conditions +.>
Figure BDA0004162958540000127
Less than->
Figure BDA0004162958540000128
The probability of (2) cannot be greater than
Figure BDA0004162958540000129
The constraint R1-2 is expressed in +.>
Figure BDA00041629585400001210
For all k and n under the conditions +.>
Figure BDA00041629585400001211
Less than->
Figure BDA00041629585400001212
The probabilities of (2) are equal to +.>
Figure BDA00041629585400001213
The proving process in which the constraint R1 is more stringent than the interrupt constraint M1 will be explained below. The proving process comprises two parts of parameter definition and proving reasoning.
The parameter definition part is as follows: first defined according to shannon's formula
Figure BDA00041629585400001214
Similarly, in the case of imperfect CSI, the scheduling spectral efficiency of the kth link to select the nth sub-channel in time slot t is:
Figure BDA00041629585400001215
Wherein,,
Figure BDA00041629585400001218
indicating that the kth link under actual schedulability selects the channel-to-interference ratio of the nth sub-channel at time slot t.
From equation (2) and equation (7)
Figure BDA00041629585400001216
Then
Figure BDA00041629585400001217
From equation (8)/>
Figure BDA0004162958540000131
The original outage probability constraint M1 may be written as: />
Figure BDA0004162958540000132
From equations (7), (8) into equation (9):
Figure BDA0004162958540000133
from the full probability formula:
Figure BDA0004162958540000134
wherein,,
Figure BDA0004162958540000135
Figure BDA0004162958540000136
wherein Pr (E) 1 ) Is shown in
Figure BDA0004162958540000137
And->
Figure BDA0004162958540000138
Under the condition->
Figure BDA0004162958540000139
Less than->
Figure BDA00041629585400001310
Probability of Pr (E) 2 ) Is indicated at->
Figure BDA00041629585400001311
And->
Figure BDA00041629585400001312
Under the condition->
Figure BDA00041629585400001313
Less than->
Figure BDA00041629585400001314
Is a probability of (2).
The proving and reasoning part is as follows:
wherein the constraint R1-1 in the constraint R1 proves as follows:
will restrict R1-1
Figure BDA00041629585400001315
Replaced by->
Figure BDA00041629585400001316
Then->
Figure BDA00041629585400001317
Figure BDA00041629585400001318
And then get->
Figure BDA00041629585400001319
For Pr (E) 2 ) For the sake of->
Figure BDA00041629585400001320
There is->
Figure BDA00041629585400001321
Must be less than->
Figure BDA00041629585400001322
The constraints R1-2 in the constraint R1 are demonstrated as follows:
from the full probability formula
Figure BDA00041629585400001323
Figure BDA0004162958540000141
Due to Pr (E) 1 ) Less than or equal to 1, there is->
Figure BDA0004162958540000142
Figure BDA0004162958540000143
From the above-described proving process of constraint R1-1 and constraint R1-2 in constraint R1, it can be seen that constraint R1 is a stricter constraint than constraint M1.
In the optimization problem conversion process, the original optimization problem is converted according to a stricter constraint R1. The specific derivation process is explained below from the transformation of constraint R1-1 in constraint R1, the transformation of constraint R1-2, and the transformation of the non-convex optimization objective with the outage probability constraint.
According to the stricter constraint R1-1 mentioned above, it is possible to obtain:
Figure BDA0004162958540000144
from the Markov inequality, we can get from equation (12):
Figure BDA0004162958540000145
let the right side of equation (13) equal
Figure BDA0004162958540000146
Then there are:
Figure BDA0004162958540000147
according to the stricter constraint R1-2 mentioned above, it is possible to obtain:
Figure BDA0004162958540000148
wherein F represents a cumulative distribution function of chi-square distribution, let equation (15) equal to
Figure BDA0004162958540000151
The method can obtain the following steps:
Figure BDA0004162958540000152
wherein F is -1 An inverse Cumulative Distribution Function (CDF) representing the chi-square distribution. Due to
Figure BDA00041629585400001510
Figure BDA0004162958540000153
And is also provided with
Figure BDA0004162958540000154
Substituting these two terms and equation (16) into equation (14) yields: />
Figure BDA0004162958540000155
Thus, it is possible to obtain:
Figure BDA0004162958540000156
equation (18) is equivalent to:
Figure BDA0004162958540000157
thus, the average spectral efficiency of the wireless network after the parameter transformation at time slot t is expressed as:
Figure BDA0004162958540000158
wherein F is -1 An inverse Cumulative Distribution Function (CDF) representing the chi-square distribution.
In summary, the conversion from the non-convex optimization objective with the outage probability constraint to the non-convex optimization objective without the outage probability constraint is:
Figure BDA0004162958540000159
wherein,,
Figure BDA0004162958540000161
Figure BDA0004162958540000162
Figure BDA0004162958540000163
Figure BDA0004162958540000164
Figure BDA0004162958540000165
Figure BDA0004162958540000166
wherein Ω t The wireless network is represented by the average spectrum efficiency in time slot t after parameter transformation. It should be noted that, in the present invention, since the imperfect CSI caused by the channel estimation error is considered in the resource allocation in a more practical scenario, but the imperfect CSI may cause the outage probability, the constraint of having the outage probability in the optimization model cannot be satisfied The original algorithm based on perfect CSI is directly used for solving. Therefore, after the parameter conversion is carried out on the optimization model, a new learning algorithm based on the imperfect CSI is designed aiming at the converted optimization model, and parameters such as the imperfect CSI, channel estimation errors and the like are designed as a part of a state set, so that the deep reinforcement learning network can effectively learn the influence of the imperfect CSI to improve the optimization target effect and the performance of the algorithm which can be achieved by the learning algorithm under the imperfect CSI environment.
2. Model training
The non-convex optimization target with the outage probability constraint is converted through the steps, and the non-convex optimization target (formula 21) without the outage probability constraint is obtained, and still belongs to the NP-Hard problem. Conventional algorithms, such as the solution described in reference [5] mentioned in the background section, require multiple iterations to converge and do not scale well as the number of user links increases. Furthermore, it is very challenging for a centralized controller in a communication system to acquire instantaneous global CSI and send an allocation scheme back to the BS. In order to make the non-convex optimization objective without outage probability constraint resolvable, the joint wireless communication requirement (the present embodiment is described by taking the wireless communication requirement as an example to maximize the spectrum efficiency of the wireless network, but it does not mean that the wireless communication requirement only maximizes the spectrum efficiency) is first decoupled into two sub-problems, namely, a sub-channel selection sub-problem and a power allocation sub-problem. The problem of maximizing the spectral efficiency of the wireless network is then handled with a learning model (initial resource allocation system) that can handle both sub-problems simultaneously to improve the convergence performance of the final resource allocation system and the effectiveness of the optimization objective. The final resource allocation system is obtained by taking a non-convex optimization target as a training target, and adopting a training set formed by non-perfect global channel state information and a resource allocation strategy related to the training target to train the initial resource allocation system.
According to one embodiment of the invention, an initial resource allocation system includes: channel allocation model (also referred to as a first tier network in the present embodiment) and power allocation model (also referred to as a second tier network in the present embodiment)) The method comprises the steps of carrying out a first treatment on the surface of the The channel allocation model is used for predicting the channel allocation strategy of a time slot based on the imperfect global channel state information of the time slot; preferably, the channel allocation model is configured as a DQN network, DDQN network, or a Dueling DQN network; the power allocation model is used for predicting the power allocation strategy of a time slot based on the imperfect global channel state information of the time slot, and is preferably configured as a DDPG network. It should be noted that, the channel allocation sub-problem belongs to a discrete type task, the power allocation sub-problem belongs to a continuous type task, and quantization errors can be avoided from being introduced in the dual-layer learning network architecture formed by the channel allocation model and the power allocation model in the foregoing embodiment. Wherein, for channel allocation, a DQN network, a DDQN network or a Dueling DQN network is adopted to process discrete variable type resources; for power allocation, the channel power is defined by P max A limited continuous scalar (for some algorithms, such as a value-based DQN algorithm, the action space must be limited, the transmit power may be discretized, the discretization of the continuous variable necessarily results in quantization errors), in order to avoid the channel power being discretized, the second layer network in the present invention employs a DDPG network, which includes an Actor network and a Critic network, the Actor network acting to output the allocated power, the Critic network acting to evaluate the action of the Actor network and update parameters in the Actor network; the layer two network, through the Actor network in the DDPG, can output a power allocation policy consisting of deterministic power allocation actions based on the imperfect global channel state information of one slot. Therefore, the optimal sub-channel action can be learned faster by adopting the DQN network, the DDQN network or the lasting DQN network, and the non-convex optimization target which does not contain the outage probability constraint can be obtained by combining the DDPG to process the continuous variable type resource (channel power distribution), so that the method has faster convergence rate and higher frequency spectrum efficiency compared with the conventional algorithm.
According to one embodiment of the present invention, the step S4 includes a step S41, a step S42, and a step S43 in training the initial resource allocation system. Wherein, step S41 includes: acquiring imperfect global channel state information of an input time slot and executing the following steps: s411, predicting a channel allocation strategy of an input time slot according to the imperfect global channel state information of the input time slot by a channel allocation model, updating the imperfect global channel state information of the input time slot based on the predicted channel allocation strategy, and predicting the power allocation strategy of the input time slot by a power allocation model according to the updated imperfect global channel state information of the input time slot; the channel allocation strategy and the power allocation strategy of the predicted input time slot interact with the wireless network to obtain imperfect global channel state information of the next time slot of the input time slot, the channel allocation strategy of the next time slot of the input time slot is predicted by a channel allocation model according to the imperfect global channel state information of the next time slot of the input time slot, and the channel allocation strategy of the next time slot of the input time slot is updated based on the channel allocation strategy of the next time slot of the input time slot; s412, calculating the frequency spectrum efficiency rewards of the input time slot based on the channel allocation strategy and the power allocation strategy of the input time slot; s413, storing a piece of channel allocation experience into a channel selection playback pool according to the non-perfect global channel state information of the input time slot, the channel allocation strategy of the input time slot, the frequency spectrum efficiency rewards of the input time slot and the non-perfect global channel state information of the next time slot of the input time slot; and storing the experience of power allocation into a power selection playback pool by using the updated non-perfect global channel state information of the input time slot, the power allocation strategy of the input time slot, the frequency spectrum efficiency rewards of the input time slot and the non-perfect global channel state information of the next time slot of the updated input time slot. Step S42 includes: the non-perfect global channel state information of the next slot of the last input slot is the non-perfect global channel state information of the new input slot. Step S43 includes: the initial resource allocation system parameters are updated based on the channel allocation experience in the channel selection playback pool and the power allocation in the power selection playback pool until convergence. It should be noted that, calculating the spectral efficiency rewards of the input time slots based on the channel allocation policy and the power allocation policy of the input time slots, where the spectral efficiency rewards represent the overall contributions of the channel allocation and the power allocation to the optimization objective, may enable the channel allocation model and the power allocation model to share the same rewards function and work cooperatively with the objective of maximizing the spectral efficiency of the wireless network.
In accordance with one embodiment of the present invention, in the step S43, updating parameters of the channel allocation model is started when there is a channel allocation experience in the channel selection playback pool; updating parameters of the power allocation model begins when there is a power allocation experience in the power selection playback pool.
According to one embodiment of the present invention, in the step S43, the parameters of the channel allocation model are updated a plurality of times until convergence is achieved after the channel allocation experience in the channel selection playback pool reaches a preset number of experiences, wherein each update is performed to obtain a plurality of channel allocation experiences by randomly sampling from the channel selection playback pool, and the parameters of the channel allocation model are updated in a gradient descent manner based on the sampled channel allocation experiences; and updating the parameters of the power distribution model for a plurality of times until convergence is achieved after the power distribution experience in the power selection playback pool reaches the preset number of experiences, wherein a plurality of power distribution experiences are obtained by randomly sampling from the power selection playback pool when each update is performed, and the parameters of the power distribution model are updated in a gradient descent mode based on the sampled power distribution experiences. It should be noted that, when the number of experience pieces in the channel selection playback pool reaches the threshold value, the channel allocation experience stored in the channel selection playback pool is stored in the channel selection playback pool in a manner of replacing the channel allocation experience stored first in the current channel selection playback pool (i.e., in a first-in first-out manner); when the number of experience pieces in the power selection playback pool reaches a threshold value, the power allocation experience stored in the power selection playback pool is stored in the power selection playback pool in a manner that replaces the first stored power allocation experience in the current channel selection playback pool (i.e., a first-in-first-out manner). The experience in the channel selection playback pool or the power selection playback pool reaches the preset number and then starts to be used for random sampling, so that the convergence speed of the channel allocation model or the power allocation model can be increased; the threshold value of the experience number is set, so that the hardware requirement of model training can be reduced; the channel selection playback pool or the power selection playback pool adopts a first-in first-out mode to store experiences, so that the newly generated better experience can well replace relatively worse experience, the experience in the channel selection playback pool or the power selection playback pool is in a storage state of optimal experience when being sampled, and the convergence speed of a channel allocation model or a power allocation model is further accelerated.
In order to better train the initial resource allocation system, the imperfect global channel state information of the input time slot is firstly obtained in the process of training the initial resource allocation system. According to one embodiment of the present invention, in training the initial resource allocation system, the non-perfect global channel state information of the input time slot includes non-perfect global channel state information (also referred to as a state set in the present invention) of a plurality of links selecting different sub-channels in the input time slot, wherein the non-perfect global channel state information of one link after selecting the different sub-channels in the input time slot is a state set, and the non-perfect global channel state information of the plurality of links after selecting the different sub-channels in the input time slot is:
Figure BDA0004162958540000191
wherein,,
Figure BDA0004162958540000192
Figure BDA0004162958540000193
wherein,,
Figure BDA0004162958540000194
a state set representing the kth link selecting the nth sub-channel in time slot t,/for>
Figure BDA0004162958540000195
Independent channel gain indicating that the kth link selects the nth sub-channel in time slot t in the presence of channel estimation errors,/>
Figure BDA0004162958540000196
Indicating the channel power of the kth link selecting the nth sub-channel in time slot t,/for>
Figure BDA0004162958540000197
An identification value indicating that the kth link has selected the nth sub-channel in time slot t-1,/for>
Figure BDA0004162958540000198
Indicating the power of the kth link selecting the nth sub-channel in time slot t-1, +. >
Figure BDA0004162958540000199
Representing the spectral efficiency of the kth link corresponding to time slot t-1, < >>
Figure BDA00041629585400001910
Representing the k-th link selecting the estimated small-scale fading component corresponding to the n-th sub-channel in time slot t +.>
Figure BDA00041629585400001911
The ratio to total interference power is the rank value over all channels,
Figure BDA00041629585400001912
indicating that the kth link adopts the subchannel allocation scheme of the last time slot and co-channel interference under the power allocation scheme when the nth subchannel is selected by the time slot t, k' indicating other links different from k>
Figure BDA00041629585400001913
Representing the variance of the channel estimation error,/and>
Figure BDA00041629585400001914
is a large-scale fading component considering shadow fading and geometric fading,/->
Figure BDA00041629585400001915
Mean value 0 and variance 0
Figure BDA00041629585400001916
Is a complex gaussian distribution of (c). According to one embodiment of the invention, when the imperfect global channel state information of the input time slot is a state set, the first layer network is configured as the same number of channel allocation models as the number of links, and the second layer network is configured as the same number of power allocation models as the number of links, wherein one state set is processed by one channel allocation model and one power allocation model. According to one embodiment of the invention, when the imperfect global channel state information of the incoming time slot is a state set, the first layer network is configured as a channel allocation model and the second layer network is configured as a power allocation model, wherein each state set in the state set is processed in turn by one channel allocation model and one power allocation model. The first layer network is configured as the same number of channel allocation models as the number of links and the second layer network is configured as the same number of power allocation models as the number of links, so that the processing speed of the initial resource allocation model can be improved. According to one embodiment of the present invention, when the non-perfect global channel state information of the input slot is a state set in step S411, the non-perfect global channel state information of the input slot is updated based on the predicted channel allocation policy as follows: and selecting a state set corresponding to the channel allocation strategy for executing the prediction from the state set based on the channel allocation strategy predicted by the channel allocation model as the imperfect global channel state information for updating the input time slot.
It should be noted that, the selection of the state set is important for the training effect of the initial resource allocation system, and the selection of the state set should embody the feature of the imperfect CSI, that means that the relevant channel state information must be selected to embody the information of the imperfect CSI as the element in the state set, so as to avoid the unnecessary channel state information in the imperfect global channel state information, and improve the training effect of the initial resource allocation model. The variance of the channel estimation error, the estimated channel gain (independent channel gain), and the ranking value of the ratio of the estimated small-scale fading component corresponding to each link in a certain time slot selection sub-channel to the total interference power on all channels are key features which can best embody the channel state information corresponding to the imperfect CSI.
According to one embodiment of the invention, the spectral efficiency rewards are calculated as follows:
Figure BDA0004162958540000201
wherein,,
Figure BDA0004162958540000202
Figure BDA0004162958540000203
wherein,,
Figure BDA0004162958540000204
representing the spectral efficiency, epsilon, of the kth link corresponding to the nth sub-channel selected in time slot t out Representing the expected outage probability, +.>
Figure BDA0004162958540000205
Indicating the scheduling spectral efficiency of the kth link in time slot t for selecting the nth sub-channel, phi being the weighting coefficient of the interference, k' indicating other links than k, < >>
Figure BDA0004162958540000206
Indicating the external interference of the kth link selecting the nth sub-channel in time slot t +.>
Figure BDA0004162958540000207
Indicating the spectral efficiency of link k' without kth link interference in the nth sub-channel of time slot t, for>
Figure BDA0004162958540000208
Indicating the spectral efficiency corresponding to the kth' link selecting the nth sub-channel in time slot t. It should be noted that by defining the weight coefficient of the interference, the variance of the reward function can be reduced, and preferably, Φ=1.
In order to better explain the parameter updating process of the initial resource allocation system of the present invention, the following description will be given by taking an example of the process of performing parameter updating in a manner of randomly sampling from the channel selection playback pool and the power selection playback pool in the initial resource allocation system composed of the Dueling DQN network and the DDPG network. It should be noted that, selecting the lasting DQN network can distinguish whether the spectrum efficiency gain depends on the sub-channel action taken, or simply because the input state set is better, the channel allocation task can be well completed, so as to further improve the performance of the initial resource allocation system after convergence.
As shown in fig. 2, the initial resource allocation system is composed of a Dueling DQN Network, a DDPG (including an Actor Network) and a Critic Network (Critic Network), wherein the Actor Network and the Critic Network constitute the DDPG Network. It should be noted that, the above-mentioned channel allocation policy corresponds to a channel selection action of each link to a subchannel in each time slot, and the power allocation policy corresponds to a power allocation action of each link to the selected subchannel in each time slot. The process of generating the channel allocation experience and the power allocation experience will be schematically explained in the following in the initial resource allocation system: state set of kth link in time slot t
Figure BDA0004162958540000211
Inputting a Dueling DQN network, the Dueling DQN network predicts a channel selection action of the sub-channel +.>
Figure BDA0004162958540000212
Channel selection action in predicted sub-channel>
Figure BDA0004162958540000213
After that, based on->
Figure BDA0004162958540000214
Update->
Figure BDA0004162958540000215
Get updated state set +.>
Figure BDA0004162958540000216
(i.e.)>
Figure BDA0004162958540000217
Middle->
Figure BDA0004162958540000218
Corresponding state set->
Figure BDA0004162958540000219
) And use->
Figure BDA00041629585400002110
As an input to the DDPG network, an Actor network in the DDPG network is based on +.>
Figure BDA00041629585400002111
Predicting power allocation actions of sub-channels +.>
Figure BDA00041629585400002112
The base station performs two actions in sequence at the beginning of the time slot t (/ -)>
Figure BDA00041629585400002113
And->
Figure BDA00041629585400002114
) To determine its associated sub-channel and the transmit power of this sub-channel, the base station performs in sequence two actions (/ -) >
Figure BDA00041629585400002115
And->
Figure BDA00041629585400002116
) Then, after interacting with the wireless network environment (i.e. ultra-dense network environment under imperfect CSI), a state set of next time slot t+1 is generated>
Figure BDA00041629585400002117
And channel selection actions of sub-channels predicted based on channel allocation model->
Figure BDA00041629585400002118
And channel allocation actions of sub-channels predicted by the power allocation model +.>
Figure BDA00041629585400002119
Calculating spectral efficiency rewards->
Figure BDA00041629585400002120
The input lasting DQN network predicts the selection of sub-channels>
Figure BDA00041629585400002121
Channel selection action in predicted sub-channel>
Figure BDA00041629585400002122
After that, based on->
Figure BDA00041629585400002123
Update->
Figure BDA00041629585400002124
Get updated state set +.>
Figure BDA00041629585400002125
(i.e.)>
Figure BDA00041629585400002127
In (a)
Figure BDA00041629585400002128
Corresponding state set->
Figure BDA00041629585400002129
). Wherein, will->
Figure BDA00041629585400002130
Storing in a channel selection playback pool (shown in FIG. 2) as a channel allocation experience and storing +.>
Figure BDA00041629585400002131
Figure BDA00041629585400002132
The power selection playback pool (shown in fig. 2) is stored as a power allocation experience. Whereby channel allocation experience and power allocation experience are continuously generated during training of the initial resource allocation system.
In the parameter updating process, parameters of the reducing DQN are updated to converge based on experience in the channel selection playback pool and parameters of the DDPG network are updated to converge based on experience in the power selection playback pool. It should be noted that, the Critic network in the DDPG is only used during training, and only the Actor network is needed to perform power allocation during actual deployment, that is, taking an initial resource allocation system formed by the lasting DQN network and the DDPG network as an example, the final resource allocation system is as follows: training to converge on the Dueling DQN network and the Actor network.
Since training the Dueling DQN and DDPG network to converge is a process known to those skilled in the art, specific conditions for convergence are not described herein, and the initial resource allocation system composed of the Dueling DQN network and the DDPG network continues to explain a parameter update process of the initial resource allocation system. Preferably, a parameter updating process of the initial resource allocation system is explained by adopting a mode of randomly sampling a channel selection playback pool to obtain a plurality of channel allocation experiences and calculating a gradient to update parameters of the lasting DQN network based on the sampled channel allocation experiences, and a mode of randomly sampling a power selection playback pool to obtain a plurality of power allocation experiences and calculating a gradient to update parameters of the DDPG network based on the sampled power allocation experiences.
Preferably, the channel allocation experience set B is derived from randomly sampling channel allocation experiences of a channel selection playback pool 1 And calculates the gradient to update the parameters of the lasting DQN network using the following rules:
Figure BDA0004162958540000221
wherein,,
Figure BDA0004162958540000222
Figure BDA0004162958540000223
wherein θ c Representing trainable parameters of hidden layer in a Dueling DQN network, β representing trainable parameters of fully connected layer of value function V, χ representing trainable parameters of fully connected layer of dominance function a, B 1 Represents a set of channel allocation experiences for random sampling, wherein,
Figure BDA0004162958540000224
representation B 1 One of the channel allocation experiences, |b 1 I represents the number of experience bars of the channel allocation experience set, < +.>
Figure BDA0004162958540000225
Representing the target value of the lasting DQN, < >>
Figure BDA0004162958540000226
A set of state information representing time slot t+1, Q'>
Figure BDA0004162958540000227
The Q function value representing time slot t, gamma' represents the discount coefficient in the lasting DQN network,
Figure BDA0004162958540000228
respectively represent Dueling DParameters of the target network in the QN network, +.>
Figure BDA0004162958540000229
Sub-channel selection action representing time slot t+1, < >>
Figure BDA00041629585400002210
The representation is at->
Figure BDA00041629585400002211
Cost function of time->
Figure BDA00041629585400002212
Is indicated in the state->
Figure BDA00041629585400002213
Time-selective action->
Figure BDA00041629585400002214
Dominance function value->
Figure BDA00041629585400002215
Is indicated in the state->
Figure BDA00041629585400002216
Time-selective action->
Figure BDA00041629585400002217
Dominance function value, |A| represents +.>
Figure BDA00041629585400002218
Is a sum of the number of times of calculation.
For updates of the DDPG network, the DDPG network approximates a behavior value function Q (s, a) (Critic network) and an action function u using a neural network θ (s) (Actor network).
Preferably, in order to update the network parameter θ of the Critic network Q A power distribution experience set B is obtained from random sampling power distribution experiences of a power selection playback pool by adopting a time difference (TD, temporal difference) error method 2 And the parameters of the Critic network are updated by calculating the minimum mean square error under the following rules:
Figure BDA0004162958540000231
wherein,,
Figure BDA0004162958540000232
wherein B is 2 Represents a set of randomly sampled power allocation experiences, wherein,
Figure BDA0004162958540000233
represents one power allocation experience in the power allocation experience set,/->
Figure BDA0004162958540000234
The Q function representing Critic network is at input +.>
Figure BDA0004162958540000235
Function value of time->
Figure BDA0004162958540000236
Channel selection action (determined by the Actor network activation function) indicating that the kth link of the Actor network outputs a fixed subchannel over the aging DQN at time slot t>
Figure BDA0004162958540000237
Power allocation action of determination of power of rear output, |b 2 I represents the empirical number of the randomly sampled power allocation empirical set, y' represents the target value of the DDPG network, y represents the discount coefficient in the DDPG network, +.>
Figure BDA0004162958540000238
The Q function representing Critic network is at input +.>
Figure BDA0004162958540000239
Function value at that time.
Based on collected power allocation experience set B 2 And the following rules are used to calculate gradients to update the parameters of the Actor network:
Figure BDA00041629585400002310
wherein,,
Figure BDA00041629585400002311
the Q function representing the Critic network is at the input +.>
Figure BDA00041629585400002312
The function value of the time is calculated,
Figure BDA0004162958540000241
the Q function representing the Critic network is at the input +.>
Figure BDA0004162958540000242
Function value at that time.
It should be noted that, when the initial distribution network is formed by the DQN network or the DDQN network and the DDPG network, the Q function and the update process of the DQN network or the DDQN network are known to those skilled in the art, and the parameter update process of the DQN network or the DDQN network is not described here.
In addition, on the basis of the above-mentioned construction method of the radio network resource allocation system, as shown in fig. 3, the present invention further provides a radio network resource management method, which includes mathematically describing the whole radio network environment by adopting the construction process of the non-convex optimization target with the outage probability constraint in the method for constructing the radio network resource allocation system, forming the non-convex optimization target with the outage probability constraint based on the mathematical description (i.e. the model construction in fig. 3), then converting the non-convex optimization target with the outage probability constraint into the non-convex optimization target without the outage probability constraint (i.e. the parameter transformation in fig. 3), adopting the initial resource allocation system to perform solution training (i.e. the dual-layer network architecture in fig. 3), and when the initial resource allocation system converges, obtaining the resource allocation scheme by using the resource allocation system to allocate the radio network resources in the radio communication system. According to one embodiment of the present invention, a wireless network resource management method includes: t1, acquiring a wireless network state of a wireless communication system in a last time slot; t2, based on the wireless network state of the last time slot obtained in the step T1, predicting a resource allocation strategy of the wireless resource allocation system at the next time by adopting the method for constructing the wireless network resource allocation system; and T3, distributing wireless network resources in the wireless communication system based on the resource distribution strategy of the next moment obtained in the step T2.
The present invention also provides a radio communication system including a plurality of base stations each including a radio resource management unit configured to allocate radio network resources in the base station using the above-described radio network resource management method. The wireless network resource distribution system configured in the wireless resource management unit adopts a centralized training mode in a training stage, namely, a state set in a multi-cell multi-user network scene is selected for training; and in the deployment stage, the trained wireless network resource distribution system obtained by centralized training is distributed to each base station. Thereby improving the effect of wireless communication system for allocating wireless network resources.
3. Experiment verification
In order to better illustrate the technical effects of the invention, the following simulation experiments are used for verification.
First, the simulation parameter setting is introduced: setting the wireless network scene as a downlink with multi-cell multi-user network scene, wherein K links are distributed in M cells and share N orthogonal sub-channels, namely M/K users are arranged in each cell; for cell i, the base station BS is located in the center of cell i and serves M/K users randomly distributed within the cell; the large-scale path loss is 128.1+37.6log 10 (d) Calculation, where d is the distance from the transmitter to the receiverIn kilometers; the upper limit of the SINR (signal to interference plus noise ratio) received by the user is set to be 30dB, and the noise power sigma 2 Setting to-114 dBm, the optimization problem is required to select the maximum spectral efficiency, and the initial distribution network of the present invention employs a Dueling DQN network (i.e., a Dueling DQN network) and a DDPG network, each having three hidden layers and 200, and 100 neurons, respectively. Simulation parameters in addition to the above settings, the detailed setting parameters of the remaining simulation parameters are shown in table 1.
TABLE 1
Simulation parameters Value of Simulation parameters Value of
Cell radius 200m Power threshold of link 38dB
Interrupt probability 0.1 Channel estimation error variance 0.1
Time slot spacing 20ms Weighting coefficients of interference 1
In the simulation experiment, the method for constructing the wireless network resource allocation system (hereinafter referred to as the algorithm) is adopted to carry out three groups of comparison experiments (testing the convergence, generalization capability and spectrum efficiency performance of a training stage and the performance of the spectrum efficiency of the algorithm under different subchannel numbers) together with other baseline algorithms, wherein the baseline algorithms comprise a random algorithm, a random algorithm and an FP algorithm (split planning algorithm) described in a reference document [5] in the background art, a joint learning algorithm (joint learning algorithm) described in a reference document [7] and a distribution learning algorithm described in a reference document [9 ]; specifically, the partial planning algorithm comprises randomly allocated sub-channels and power values and a traditional model-driven algorithm, and has higher computational complexity; the joint learning algorithm adopts a DQN network to optimize two variables, namely a subchannel and power; the distribution learning algorithm optimizes the sub-channels with DQN and optimizes the power with DDPG under perfect CSI.
In the training stage, the training process of the algorithm and the four baseline algorithms comprises 20 epodes, each epode comprises 2000 time slots, namely, the algorithm stops training and parameter updating after a fixed 2000 time steps are carried out in one epode, and new user distribution is set and learning rate and other parameters are reset at the beginning of each epode, so that the algorithm and the four baseline algorithms are converged. The algorithm proposed in this patent and the data corresponding to the above four baseline algorithms will be described in comparison.
In order to test the convergence of the algorithm proposed by the patent and the four baseline algorithms in the training phase, only parameters such as learning rate and the like are reset in each epoode, and the distribution of users is not updated. When the number of users is 25 and the number of base stations and the number of sub-channels is 5, the convergence performance of the algorithm proposed by the present patent and the four baseline algorithms is shown in fig. 4, and it can be seen that the algorithm proposed by the present patent and the four baseline algorithms converge to about 4.0bps/Hz at each epicode base (except for the random algorithm and the split planning algorithm). In the converged algorithm, the convergence speed of the algorithm provided by the patent is faster than that of a reference algorithm (a joint learning algorithm and a distributed learning algorithm) based on learning, in addition, the frequency spectrum efficiency of the four baseline algorithms is lower than that of the algorithm provided by the patent, and when the iteration times are fewer, namely 5000-6000 iterations, the frequency spectrum efficiency of the four baseline algorithms is far lower than that of the scheme. Thus, the algorithm provided by the patent has significant advantages in terms of convergence rate.
In generalization capability and spectral efficiency performance tests, it is not practical to track the environment through frequent online training, since the channel estimation errors in a real dynamic wireless communication scenario are time-varying. Thus, the generalization ability of the algorithm is very important in a constantly changing environment. The training model with the best result when the channel estimation error variance in the above simulation parameter setting is set to 0.01. Tests are carried out under different conditions, and the relation between the frequency spectrum efficiency and the channel estimation error variance, which can be realized by the algorithm provided by the patent and the four baseline algorithms, is shown in fig. 5: the performance of the stochastic algorithm and the fractional programming algorithm deteriorates with increasing channel estimation error; the spectral efficiency of the joint learning algorithm, the distribution learning algorithm and the algorithms proposed by this patent remains almost unchanged with the variation of the channel estimation error. Therefore, the algorithm provided by the patent has stronger generalization capability under different channel estimation errors and can realize higher spectrum efficiency performance.
In the performance test of the spectrum efficiency of the algorithm under different numbers of sub-channels, when the variance of the channel estimation error is 0.1, the performance of the spectrum efficiency of the algorithm provided by the patent and the four baseline algorithms under different numbers of sub-channels is shown in fig. 6: the average spectrum efficiency of each link of the algorithm proposed by the patent and the four baseline algorithms can be gradually increased along with the increase of the number of the sub-channels, however, the algorithm proposed by the patent can obtain higher spectrum efficiency than the four baseline algorithms. That is, the algorithm proposed by the present patent is easier to expand in a multi-cell network and performs better as the number of subchannels increases.
In summary, since the existing learning-based method is not designed for the imperfect CSI, and the channel estimation error in the actual communication environment exists objectively and cannot be completely eliminated, the optimization objective effect (i.e., communication performance) achieved by directly adopting the existing learning-based algorithm in the imperfect CSI environment is poor and the convergence rate of the algorithm is low. The estimated channel gain and the corresponding error (represented by the variance of the channel estimation error) are designed to the state set of the initial resource allocation system, and the corresponding reward (for example, spectrum efficiency reward) is designed in a targeted manner, so that the initial resource allocation system can obtain better gain under the channel estimation error, the convergence rate of the initial resource allocation system is better than that of the baseline algorithm, and the spectrum efficiency performance can be obviously improved, that is, the provided method for constructing the wireless network resource allocation system is more suitable for obtaining the final resource allocation system in the actual dynamic wireless communication environment (imperfect CSI).
It should be noted that, although the steps are described above in a specific order, it is not meant to necessarily be performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order, as long as the required functions are achieved.
The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (14)

1. A method for constructing a wireless network resource allocation system for obtaining a wireless network resource allocation policy according to a wireless network state, the method comprising:
s1, acquiring a non-convex optimization target with interruption probability constraint corresponding to wireless communication requirements in a non-perfect global channel state information environment;
s2, converting the non-convex optimization target obtained in the step S1 to obtain a non-convex optimization target without interruption probability constraint;
s3, acquiring imperfect global channel state information of the wireless network;
s4, taking the non-convex optimization target in the step S2 as a training target, taking the imperfect global channel state information in the step S3 as input, and training an initial resource allocation system to be converged in a reinforcement learning mode, wherein the initial resource allocation system is a system which is constructed based on an intelligent agent and is used for generating an action set based on a wireless network state, and the action set comprises a channel allocation strategy and a power allocation strategy.
2. The method of claim 1, wherein the wireless communication requirement is to maximize spectral efficiency of a wireless network, the non-convex optimization objective with outage probability constraints:
Figure FDA0004162958530000011
Wherein,,
Figure FDA0004162958530000012
Figure FDA0004162958530000013
Figure FDA0004162958530000014
Figure FDA0004162958530000015
Figure FDA0004162958530000016
Figure FDA0004162958530000017
Figure FDA0004162958530000018
wherein,,
Figure FDA0004162958530000019
represents the average spectral efficiency of the wireless network at time slot t, K represents the total number of links N represents the total number of subchannels,
Figure FDA00041629585300000110
representing a set of sub-channel indices,/->
Figure FDA00041629585300000111
Indicating the scheduling spectral efficiency of the kth link selecting the nth sub-channel in time slot t,/for>
Figure FDA00041629585300000112
Indicating the maximum spectral efficiency of the kth link selecting the nth sub-channel in time slot y,/for the kth link>
Figure FDA00041629585300000113
Representing the estimated small-scale fading component of the kth link selecting the nth sub-channel in time slot t,/>
Figure FDA00041629585300000114
Representing the estimated small-scale fading component +.>
Figure FDA00041629585300000115
Is->
Figure FDA00041629585300000116
Probability of->
Figure FDA00041629585300000117
Indicating the power, p, of the kth link to select the nth sub-channel in time slot t t Indicating all->
Figure FDA0004162958530000021
Power set, +.>
Figure FDA0004162958530000022
Representing the identification value, alpha, of the kth link after the selection of the nth sub-channel in time slot t t Indicating all->
Figure FDA00041629585300000211
Identification value-made set epsilon out Representing the expected outage probability, P max Representing the power threshold of the link, constraint M1 represents the estimated small-scale fading component +.>
Figure FDA0004162958530000023
For the condition that the probability of any one link being interrupted after any one sub-channel is selected in the time slot t is smaller than the expected interruption probability, the constraint M2 indicates that the transmitting power on each link cannot be higher than the power threshold of the link, and the constraints M3 and M4 indicate that each link can only select one sub-channel in each time slot.
3. The method according to claim 2, characterized in that in said step S2, the non-convex optimization objective is transformed by means of a parametric transformation to obtain a non-convex optimization objective free of outage probability constraints:
Figure FDA0004162958530000024
wherein,,
Figure FDA0004162958530000025
Figure FDA0004162958530000026
Figure FDA0004162958530000027
Figure FDA0004162958530000028
Figure FDA0004162958530000029
Figure FDA00041629585300000210
wherein Ω t The wireless network is represented by the average spectrum efficiency in time slot t after parameter transformation.
4. A method according to claim 3, wherein the initial resource allocation system comprises:
a channel allocation model for predicting a channel allocation policy of a slot based on imperfect global channel state information of the slot, which is configured as a DQN network, a DDQN network, or a lasting DQN network;
a power allocation model for predicting a power allocation policy for a slot based on imperfect global channel state information for the slot, which is configured as a DDPG network.
5. The method according to claim 4, wherein the step S4 includes:
s41, acquiring imperfect global channel state information of an input time slot and executing the following steps:
s411, predicting a channel allocation strategy of an input time slot according to the imperfect global channel state information of the input time slot by a channel allocation model, updating the imperfect global channel state information of the input time slot based on the predicted channel allocation strategy, and predicting the power allocation strategy of the input time slot by a power allocation model according to the updated imperfect global channel state information of the input time slot; the channel allocation strategy and the power allocation strategy of the predicted input time slot are interacted with the wireless network to obtain the imperfect global channel state information of the next time slot of the input time slot,
Predicting a channel allocation strategy of the next time slot of the input time slot according to the imperfect global channel state information of the next time slot of the input time slot by a channel allocation model, and updating the channel allocation strategy of the next time slot of the input time slot based on the channel allocation strategy of the next time slot of the input time slot;
s412, calculating the frequency spectrum efficiency rewards of the input time slot based on the channel allocation strategy and the power allocation strategy of the input time slot;
s413, storing a piece of channel allocation experience into a channel selection playback pool according to the non-perfect global channel state information of the input time slot, the channel allocation strategy of the input time slot, the frequency spectrum efficiency rewards of the input time slot and the non-perfect global channel state information of the next time slot of the input time slot; storing a power allocation experience into a power selection playback pool by using updated imperfect global channel state information of an input time slot, a power allocation strategy of the input time slot, a spectrum efficiency reward of the input time slot and updated imperfect global channel state information of a next time slot of the input time slot;
s42, the imperfect global channel state information of the next time slot of the last input time slot is the imperfect global channel state information of the new input time slot;
S43, updating the initial resource allocation system parameters based on the channel allocation experience in the channel selection playback pool and the power allocation in the power selection playback pool until convergence.
6. The method according to claim 5, wherein in said step S43, updating parameters of the channel allocation model is started when there is a channel allocation experience in the channel selection playback pool; updating parameters of the power allocation model begins when there is a power allocation experience in the power selection playback pool.
7. The method according to claim 5, characterized in that in said step S43;
updating parameters of the channel allocation model for a plurality of times until convergence is achieved after the channel allocation experience in the channel selection playback pool reaches the experience of a preset number, wherein the channel allocation experience is obtained by randomly sampling from the channel selection playback pool during each updating, and the parameters of the channel allocation model are updated in a gradient descent mode based on the sampled channel allocation experience;
and updating the parameters of the power distribution model for a plurality of times until convergence is achieved after the power distribution experience in the power selection playback pool reaches the preset number of experiences, wherein a plurality of power distribution experiences are obtained by randomly sampling from the power selection playback pool when each update is performed, and the parameters of the power distribution model are updated in a gradient descent mode based on the sampled power distribution experiences.
8. The method according to any of claims 4-7, wherein the non-perfect global channel state information of the input slot in step S41 comprises non-perfect global channel state information of a plurality of links selecting different sub-channels in the input slot:
Figure FDA0004162958530000041
wherein,,
Figure FDA0004162958530000042
Figure FDA0004162958530000043
wherein,,
Figure FDA0004162958530000044
a state set representing the kth link selecting the nth sub-channel in time slot t,/for>
Figure FDA0004162958530000045
Independent channel gain indicating that the kth link selects the nth sub-channel in time slot t in the presence of channel estimation errors,/>
Figure FDA0004162958530000046
Indicating the channel power of the kth link selecting the nth sub-channel in time slot t,/for>
Figure FDA0004162958530000047
An identification value indicating that the kth link has selected the nth sub-channel in time slot t-1,/for>
Figure FDA0004162958530000048
Indicating the power of the kth link selecting the nth sub-channel in time slot t-1, +.>
Figure FDA0004162958530000049
Representing the spectral efficiency of the kth link corresponding to time slot t-1, < >>
Figure FDA00041629585300000410
Representing the k-th link selecting the estimated small-scale fading component corresponding to the n-th sub-channel in time slot t +.>
Figure FDA00041629585300000411
The ratio to total interference power is the rank value over all channels,
Figure FDA00041629585300000412
indicating the co-channel interference under the sub-channel allocation scheme and the power allocation scheme of the last time slot when the kth link selects the nth sub-channel in the time slot t, k Representing a link other than k +.>
Figure FDA00041629585300000413
Representing the variance of the channel estimation error,/and >
Figure FDA00041629585300000414
Is a large-scale fading component considering shadow fading and geometric fading,/->
Figure FDA00041629585300000415
Mean value 0, variance ++>
Figure FDA00041629585300000416
Is a complex gaussian distribution of (c).
9. The method of claim 8, wherein the spectral efficiency prize is calculated by:
Figure FDA00041629585300000417
wherein,,
Figure FDA00041629585300000418
Figure FDA00041629585300000419
wherein,,
Figure FDA00041629585300000420
representing the spectral efficiency, epsilon, of the kth link corresponding to the nth sub-channel selected in time slot t out Representing the expected outage probability, +.>
Figure FDA00041629585300000421
Representing the scheduling spectral efficiency of the kth link in time slot t for selecting the nth sub-channel, phi being the weighting coefficient of the interference, k Representing a link other than k +.>
Figure FDA00041629585300000422
Indicating the external interference of the kth link selecting the nth sub-channel in time slot t +.>
Figure FDA00041629585300000423
Link k indicating no kth link interference in the nth sub-channel of time slot t Is used for the spectrum efficiency of the (c) spectrum,
Figure FDA00041629585300000424
represents the kth The link selects the spectral efficiency corresponding to the nth sub-channel in time slot t.
10. The method of claim 9, wherein the DDPG network comprises an Actor network and a Critic network,
the final resource allocation system is as follows: training to a converged DQN network, DDQN network, or a sustaining DQN network and an Actor network.
11. A method of wireless network resource management, the method comprising:
T1, acquiring a wireless network state of a wireless communication system in a last time slot;
t2, based on the wireless network state of the last time slot obtained in the step T1, predicting a resource allocation policy at the next time by using the resource allocation system obtained by the method according to any one of claims 1 to 10;
and T3, distributing wireless network resources in the wireless communication system based on the resource distribution strategy of the next moment obtained in the step T2.
12. A wireless communication system comprising a plurality of base stations, wherein each base station comprises a radio resource management unit configured to allocate radio network resources in the base station using the method as claimed in claim 11.
13. A computer readable storage medium, having stored thereon a computer program executable by a processor to implement the steps of the method of any one of claims 1 to 11.
14. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to perform the steps of the method of any of claims 1-11.
CN202310354794.0A 2023-04-06 2023-04-06 Construction method and resource management method of wireless network resource allocation system Pending CN116406004A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310354794.0A CN116406004A (en) 2023-04-06 2023-04-06 Construction method and resource management method of wireless network resource allocation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310354794.0A CN116406004A (en) 2023-04-06 2023-04-06 Construction method and resource management method of wireless network resource allocation system

Publications (1)

Publication Number Publication Date
CN116406004A true CN116406004A (en) 2023-07-07

Family

ID=87013779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310354794.0A Pending CN116406004A (en) 2023-04-06 2023-04-06 Construction method and resource management method of wireless network resource allocation system

Country Status (1)

Country Link
CN (1) CN116406004A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117474295A (en) * 2023-12-26 2024-01-30 长春工业大学 Multi-AGV load balancing and task scheduling method based on lasting DQN algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117474295A (en) * 2023-12-26 2024-01-30 长春工业大学 Multi-AGV load balancing and task scheduling method based on lasting DQN algorithm
CN117474295B (en) * 2023-12-26 2024-04-26 长春工业大学 Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method

Similar Documents

Publication Publication Date Title
Lei et al. Deep reinforcement learning-based spectrum allocation in integrated access and backhaul networks
Ge et al. Deep reinforcement learning for distributed dynamic MISO downlink-beamforming coordination
Wang et al. Learning decentralized wireless resource allocations with graph neural networks
Zhang et al. Deep reinforcement learning for multi-agent power control in heterogeneous networks
Zhao et al. Prediction-based spectrum management in cognitive radio networks
US20220248237A1 (en) Neural network circuit remote electrical tilt antenna infrastructure management based on probability of actions
Paul et al. Machine learning for spectrum information and routing in multihop green cognitive radio networks
Yang et al. Deep reinforcement learning based wireless network optimization: A comparative study
Ali et al. Deep learning (DL) based joint resource allocation and RRH association in 5G-multi-tier networks
CN116406004A (en) Construction method and resource management method of wireless network resource allocation system
Zhu et al. Learn and pick right nodes to offload
Fan et al. Deep reinforcement learning for energy efficiency optimization in wireless networks
Vishnoi et al. A deep reinforcement learning scheme for sum rate and fairness maximization among d2d pairs underlaying cellular network with noma
Giang et al. Deep Q-learning-based resource allocation for solar-powered users in cognitive radio networks
Yan et al. Deep reinforcement learning based offloading for mobile edge computing with general task graph
Mafuta et al. Decentralized resource allocation-based multiagent deep learning in vehicular network
Yan et al. Self-imitation learning-based inter-cell interference coordination in autonomous HetNets
Zhang et al. Bayesian and multi-armed contextual meta-optimization for efficient wireless radio resource management
Ghasemi et al. Spectrum allocation based on artificial bee colony in cognitive radio networks
Vaishnavi et al. Self organizing networks coordination function between intercell interference coordination and coverage and capacity optimisation using support vector machine
Alajmi et al. An efficient actor critic drl framework for resource allocation in multi-cell downlink noma
Anzaldo et al. Experience replay-based power control for sum-rate maximization in multi-cell networks
Chu et al. Reinforcement learning based multi-access control with energy harvesting
Saied et al. Resource management based on reinforcement learning for D2D communication in cellular networks
Lin et al. Online task offloading in udn: A deep reinforcement learning approach with incomplete information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination