CN116406004A

CN116406004A - Construction method and resource management method of wireless network resource allocation system

Info

Publication number: CN116406004A
Application number: CN202310354794.0A
Authority: CN
Inventors: 时宁哲; 刘玲; 周一青; 石晶林
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-07

Abstract

The invention provides a construction method of a wireless network resource allocation system, which is used for obtaining a wireless network resource allocation strategy according to a wireless network state, and comprises the following steps: s1, acquiring a non-convex optimization target with interruption probability constraint corresponding to wireless communication requirements in a non-perfect global channel state information environment; s2, converting the obtained non-convex optimization target to obtain a non-convex optimization target without interruption probability constraint; s3, acquiring imperfect global channel state information of the wireless network; s4, training the initial resource allocation system to be converged by taking the obtained non-convex optimization target as a training target and taking the imperfect global channel state information in the step S3 as input and adopting a reinforcement learning mode. The invention adopts more practical CSI (imperfect global channel state information) to train the initial distribution system based on learning, improves the convergence rate of the wireless network resource distribution system and improves the performance of completing the optimization target.

Description

Construction method and resource management method of wireless network resource allocation system

Technical Field

The present invention relates to the field of wireless communications, and more particularly, to the field of wireless communication network resource allocation, and more particularly, to a method for constructing a wireless network resource allocation system, a wireless network resource management method based thereon, and a wireless communication system.

Background

In the prior art, the transmission rate and the network capacity of users in a wireless communication network are improved by increasing the spatial frequency spectrum reuse rate in the wireless communication network and deploying a large number of wireless Access Points (APs) in the wireless communication network. However, in a deployment scenario where wireless access points are dense and irregular, there may be particularly severe co-channel interference (co-channel interference, CCI) in the wireless communication network. In addition, as the number of Base Stations (BS) deployed in a wireless communication network increases, unreasonable allocation of wireless network resources may further increase CCI and reduce communication performance such as spectrum efficiency of the wireless network. Accordingly, there is a need to reduce CCI in wireless networks and improve communication performance such as spectral efficiency of wireless networks by optimizing wireless network resource allocation (e.g., channel allocation policies and power allocation policies).

In the prior art, two main methods for solving the problem of resource allocation in a wireless network are available, namely an optimization algorithm based on model driving and an optimization algorithm based on learning.

The optimization algorithm based on model driving generally assumes perfect global channel state information (channel state information, CSI) to optimize the resource allocation problem, and when the optimization algorithm is applied to an actual wireless communication environment, the optimization algorithm has the disadvantage of excessively high computational complexity, thereby causing larger time delay and higher energy consumption, and is difficult to deploy and apply in practice because of suboptimal performance in solving the resource allocation problem in a wireless network.

Whereas learning-based optimization algorithms are typically based on deep reinforcement learning to achieve optimization. The deep reinforcement learning processes complex and high-dimensional environmental features based on the strong perception capability of the deep learning, and interacts with the environment in combination with the idea of the reinforcement learning to complete the decision process, so the deep reinforcement learning (deep reinforcement learning, DRL) is successfully applied to various fields (unmanned decision, industrial robot control and recommendation system). For the wireless communication field, resource allocation in a wireless communication environment can also be modeled as a dynamic decision process due to the dynamics of the wireless communication environment. Therefore, the method for managing the wireless resources based on the deep reinforcement learning is applied to the wireless resource allocation task and can solve the problems of the traditional wireless resource allocation method. Compared with a resource optimization algorithm based on model driving, the optimization algorithm based on learning can effectively reduce the computational complexity of resource allocation, and is more likely to be deployed and applied in future wireless network architecture. In the prior art, in the technical field of wireless communication, a learning-based optimization algorithm commonly uses perfect CSI to allocate resources in a wireless network. However, due to the error of channel estimation and the objective existence of the delay of channel feedback, the true perfect CSI is difficult to obtain, so that in the task of radio resource management, it is necessary to consider the more practical imperfect CSI in the radio environment, and as can be seen from the researches in references [1] to [8], the optimization based on the imperfect CSI is more practical. However, as mentioned above, the learning-based optimization method in the prior art is generally based on perfect CSI, for example, references [3] to [7] and [9] each design an optimization target based on perfect CSI, and the algorithm converges at a slower speed and has low performance such as spectrum efficiency. Also, as can be seen from the study of references [10] to [12], perfect CSI is difficult to obtain in a practical environment.

In summary, the existing learning-based method is not designed for the imperfect CSI, and the channel estimation error in the actual communication environment exists objectively and cannot be completely eliminated, so that the optimization target effect (i.e., the communication performance) achieved by directly adopting the existing learning-based algorithm in the imperfect CSI environment is poor and the convergence rate of the algorithm is low. Therefore, there is a need for a more efficient DRL architecture that enables resource allocation policy optimization in wireless networks based on imperfect CSI.

Reference is made to:

[1]Y.Teng,M.Liu,F.R.Yu,V.C.M.Leung,M.Song,and Y.Zhang,“Resource allocation for ultra-dense networks:A survey,some research issues and challenges,”IEEE Commun.Surv.Tut.,vol.21,no.3,pp.2134–2168,Jul.–Sep.2019.

[2]L.Liu,Y.Zhou,W.Zhuang,J.Yuan,and L.Tian,“Tractable coverage analysis for hexagonal macrocell-based heterogeneous UDNs with adaptive interference-aware CoMP,”IEEE Trans.Wireless Commun.,vol.18,no.1,pp.503–517,Jan.2019.

[3]Y.Zhang,C.Kang,T.Ma,Y.Teng,and D.Guo,“Power allocation in multi-cell networks using deep reinforcement learning,”in Proc.IEEE 88th Veh.Technol.Conf.(VTC-Fall),2018,pp.1–6.

[4]S.Lahoud,K.Khawam,S.Martin,G.Feng,Z.Liang,and J.Nasreddine,“Energy-efficient joint scheduling and power control in multicell wireless networks,”IEEE J.Sel.Areas Commun.,vol.34,no.12,pp.3409–3426,Dec.2016.

[5]K.Shen and W.Yu,“Fractional programming for communication systems—Part I:Power control and beamforming,”IEEE Trans.Signal Process.,vol.66,no.10,pp.2616—2630,May 2018.

[6]F.Meng,P.Chen,L.Wu,and J.Cheng,“Power allocation in multi-user cellular networks:Deep reinforcement learning approaches,”IEEE Trans.Wireless Commun.,vol.19,no.10,pp.6255–6267,Oct.2020.

[7]J.Tan,Y.-C.Liang,L.Zhang,and G.Feng,“Deep reinforcement learning for joint channel selection and power control in D2D networks,”IEEE Trans.Wireless Commun.,vol.20,no.2,pp.1363–1378,Feb.2021.

[8]Y.Guo,F.Zheng,J.Luo,and X.Wang,“Optimal resource allocation via machine learning in coordinated downlink multi-cell OFDM networks under imperfect CSI,”in Proc.Veh.Technol.Conf.(VTC-Spring),2020,pp.1–6

[9]Y.S.Nasir and D.Guo,“Deep Reinforcement Learning for Joint Spectrum and Power Allocation in Cellular Networks,”2021 IEEE Globecom Workshops(GC Wkshps),2021,pp.1-6.

[10]T.Yoo and A.Goldsmith,“Capacity and power allocation for fading MIMO channels with channel estimation error,”IEEE Trans.Inf.Theory,vol.52,no.5,pp.2203–2214,May 2006.

[11]F.Fang,H.Zhang,J.Cheng,S.Roy,and V.C.M.Leung,“Joint user scheduling and power allocation optimization for energy-efficient NOMA systems with imperfect CSI,”IEEE J.Sel.Areas Commun.,vol.35,no.12,pp.2874–2885,Dec.2017.

[12]X.Wang,F.-C.Zheng,P.Zhu,and X.You,“Energy-efficient resource allocation in coordinated downlink multicell OFDMA systems,”IEEE Trans.Veh.Technol.,vol.65,no.3,pp.1395–1408,Mar.2016.

disclosure of Invention

It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method for constructing a radio network resource allocation system, a radio network resource management method based thereon, and a radio communication system.

The invention aims at realizing the following technical scheme:

according to a first aspect of the present invention, there is provided a method for constructing a radio network resource allocation system for obtaining a radio network resource allocation policy according to a radio network state, the method comprising: s1, acquiring a non-convex optimization target with interruption probability constraint corresponding to wireless communication requirements in a non-perfect global channel state information environment; s2, converting the non-convex optimization target obtained in the step S1 to obtain a non-convex optimization target without interruption probability constraint; s3, acquiring imperfect global channel state information of the wireless network; s4, taking the non-convex optimization target in the step S2 as a training target, taking the imperfect global channel state information in the step S3 as input, and training an initial resource allocation system to be converged in a reinforcement learning mode, wherein the initial resource allocation system is a system which is constructed based on an intelligent agent and is used for generating an action set based on a wireless network state, and the action set comprises a channel allocation strategy and a power allocation strategy.

In some embodiments of the invention, the wireless communication requirement is to maximize the spectral efficiency of the wireless network, the non-convex optimization objective with outage probability constraints:

wherein,,

wherein,,

representing the average spectral efficiency of the wireless network in time slot t, K representing the total number of links N representing the total number of sub-channels,/->

Representing a set of sub-channel indices,/->

Indicating the scheduling spectral efficiency of the kth link selecting the nth sub-channel in time slot t,/for>

Indicating the maximum spectral efficiency of the kth link selecting the nth sub-channel in time slot t,/for the kth link>

Representing the estimated small-scale fading component of the kth link selecting the nth sub-channel in time slot t,/>

Representing the estimated small-scale fading component +.>

Is->

Probability of->

Indicating the power, p, of the kth link to select the nth sub-channel in time slot t ^t Indicating all->

Power set, +.>

Representing the identification value, alpha, of the kth link after the selection of the nth sub-channel in time slot t ^t Indicating all->

Identification value-made set epsilon _out Representing the expected outage probability, P _max Representing the power threshold of the link, constraint M1 represents the estimated small-scale fading component +.>

For the condition that the probability of any one link being interrupted after any one sub-channel is selected in the time slot t is smaller than the expected interruption probability, the constraint M2 indicates that the transmitting power on each link cannot be higher than the power threshold of the link, and the constraints M3 and M4 indicate that each link can only select one sub-channel in each time slot.

In some embodiments of the present invention, in the step S2, the non-convex optimization objective is transformed by means of a parametric transformation to obtain a non-convex optimization objective without the constraint of outage probability:

wherein,,

wherein Ω ^t The wireless network is represented by the average spectrum efficiency in time slot t after parameter transformation.

In some embodiments of the invention, the initial resource allocation system comprises: a channel allocation model for predicting a channel allocation policy of a slot based on imperfect global channel state information of the slot, which is configured as a DQN network, a DDQN network, or a Dueling DQN network; the power allocation model is used to predict a power allocation policy for a slot based on imperfect global channel state information for the slot, which is configured as a DDPG network.

In some embodiments of the invention, the step S4 includes a step S41, a step S42, and a step S43. Wherein, step S41 includes: acquiring imperfect global channel state information of an input time slot and executing the following steps: s411, predicting a channel allocation strategy of an input time slot according to the imperfect global channel state information of the input time slot by a channel allocation model, updating the imperfect global channel state information of the input time slot based on the predicted channel allocation strategy, and predicting the power allocation strategy of the input time slot by a power allocation model according to the updated imperfect global channel state information of the input time slot; the channel allocation strategy and the power allocation strategy of the predicted input time slot interact with the wireless network to obtain imperfect global channel state information of the next time slot of the input time slot, the channel allocation strategy of the next time slot of the input time slot is predicted by a channel allocation model according to the imperfect global channel state information of the next time slot of the input time slot, and the channel allocation strategy of the next time slot of the input time slot is updated based on the channel allocation strategy of the next time slot of the input time slot; s412, calculating the frequency spectrum efficiency rewards of the input time slot based on the channel allocation strategy and the power allocation strategy of the input time slot; s413, storing a piece of channel allocation experience into a channel selection playback pool according to the non-perfect global channel state information of the input time slot, the channel allocation strategy of the input time slot, the frequency spectrum efficiency rewards of the input time slot and the non-perfect global channel state information of the next time slot of the input time slot; and storing the experience of power allocation into a power selection playback pool by using the updated non-perfect global channel state information of the input time slot, the power allocation strategy of the input time slot, the frequency spectrum efficiency rewards of the input time slot and the non-perfect global channel state information of the next time slot of the updated input time slot. Step S42 includes: the non-perfect global channel state information of the next slot of the last input slot is the non-perfect global channel state information of the new input slot. Step S43 includes: the initial resource allocation system parameters are updated based on the channel allocation experience in the channel selection playback pool and the power allocation in the power selection playback pool until convergence.

In some embodiments of the present invention, in the step S43, updating parameters of the channel allocation model is started when there is a channel allocation experience in the channel selection playback pool; updating parameters of the power allocation model begins when there is a power allocation experience in the power selection playback pool.

In some embodiments of the invention, in said step S43; updating parameters of the channel allocation model for a plurality of times until convergence is achieved after the channel allocation experience in the channel selection playback pool reaches the experience of a preset number, wherein the channel allocation experience is obtained by randomly sampling from the channel selection playback pool during each updating, and the parameters of the channel allocation model are updated in a gradient descent mode based on the sampled channel allocation experience; and updating the parameters of the power distribution model for a plurality of times until convergence is achieved after the power distribution experience in the power selection playback pool reaches the preset number of experiences, wherein a plurality of power distribution experiences are obtained by randomly sampling from the power selection playback pool when each update is performed, and the parameters of the power distribution model are updated in a gradient descent mode based on the sampled power distribution experiences.

In some embodiments of the present invention, in step S41, the non-perfect global channel state information of the input slot includes non-perfect global channel state information of a plurality of links selecting different sub-channels in the input slot:

Wherein,,

wherein,,

a state set representing the kth link selecting the nth sub-channel in time slot t,/for>

Independent channel gain indicating that the kth link selects the nth sub-channel in time slot t in the presence of channel estimation errors,/>

Indicating the channel power of the kth link selecting the nth sub-channel in time slot t,/for>

An identification value indicating that the kth link has selected the nth sub-channel in time slot t-1,/for>

Indicating the power of the kth link selecting the nth sub-channel in time slot t-1, +.>

Representing the spectral efficiency of the kth link corresponding to time slot t-1, < >>

Representing the k-th link selecting the estimated small-scale fading component corresponding to the n-th sub-channel in time slot t +.>

The ratio to total interference power is the rank value over all channels,

indicating that the kth link selects the kth in time slot tCo-channel interference under the sub-channel allocation scheme and the power allocation scheme of the last slot when n sub-channels are used, k' represents other links than k, +.>

Representing the variance of the channel estimation error,/and>

is a large-scale fading component considering shadow fading and geometric fading,/->

Mean value 0, variance ++>

Is a complex gaussian distribution of (c).

In some embodiments of the present invention, the spectral efficiency rewards are calculated as follows:

Wherein,,

wherein,,

representing the spectral efficiency, epsilon, of the kth link corresponding to the nth sub-channel selected in time slot t _out Representing the expected outage probability, +.>

Indicating the scheduling spectral efficiency of the kth link in the time slot t for selecting the nth sub-channel, phi is the weight coefficient of the interference,k' represents a link other than k, < ->

Indicating the external interference of the kth link selecting the nth sub-channel in time slot t +.>

Indicating the spectral efficiency of link k' without kth link interference in the nth sub-channel of time slot t, for>

Indicating the spectral efficiency corresponding to the kth' link selecting the nth sub-channel in time slot t.

In some embodiments of the present invention, the DDPG network comprises an Actor network and a Critic network, and the final resource allocation system is: training to a converged DQN network, DDQN network, or a sustaining DQN network and an Actor network.

According to a second aspect of the present invention, there is provided a wireless network resource management method, the method comprising: t1, acquiring a wireless network state of a wireless communication system in a last time slot; t2, based on the wireless network state of the last time slot obtained in the step T1, predicting a resource allocation strategy of the next time by adopting the resource allocation system obtained by the method of the first aspect of the invention; and T3, distributing wireless network resources in the wireless communication system based on the resource distribution strategy of the next moment obtained in the step T2.

According to a third aspect of the present invention there is provided a wireless communication system comprising a plurality of base stations, each base station comprising a radio resource management unit configured to allocate radio network resources in the base station using the method according to the second aspect of the present invention.

Compared with the prior art, the invention has the advantages that: by adopting the non-convex optimization target with the outage probability constraint corresponding to the wireless communication requirement in the non-perfect global channel state information environment as the training target, the channel estimation error in the actual communication environment can be fully considered, namely, the initial resource allocation system based on learning is trained by adopting more practical CSI (non-perfect global channel state information), so that the convergence rate of the wireless network resource allocation system is improved, and the performance of completing the optimization target is improved.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

fig. 1 is a flow chart illustrating a method for constructing a wireless network resource allocation system according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an initial distribution system consisting of a Dueling DQN network and a DDPG network for model training and parameter updating architecture according to an embodiment of the present invention;

Fig. 3 is a flowchart illustrating a method for managing wireless network resources according to an embodiment of the present invention;

FIG. 4 is a graph showing the convergence performance of the algorithm proposed in this patent and the four baseline algorithms described above, according to an embodiment of the present invention;

FIG. 5 is a graph showing the relationship between the frequency spectrum efficiency and the variance of the channel estimation error, which can be achieved by the algorithm proposed in the patent and the four baseline algorithms according to the embodiment of the present invention;

fig. 6 is a schematic diagram showing the performance comparison of the algorithm proposed in this patent and the above four baseline algorithms under different numbers of subchannels according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by means of specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As mentioned in the background art, the existing learning-based method is not designed for the imperfect CSI, and the channel estimation error in the actual communication environment is objectively present and cannot be completely eliminated, so that the optimization objective effect achieved by directly adopting the existing learning-based algorithm in the imperfect CSI environment is poor and the convergence rate of the algorithm is low. In order to solve the above problems, the present invention proposes a wireless network resource allocation scheme based on imperfect CSI, starting from the characteristics of imperfect CSI. The invention further converts the non-convex optimization target with the outage probability constraint into the non-convex optimization target without the outage probability constraint and solves the problem by adopting a learning-based method, so as to improve the effect of the optimization target and the convergence speed of an algorithm which can be achieved in the non-perfect CSI environment.

For a better understanding of the present invention, the following detailed description of the invention is made with reference to the drawings and examples.

According to an embodiment of the present invention, there is provided a method for constructing a wireless network resource allocation system for obtaining a wireless network resource allocation policy according to a wireless network state, as shown in fig. 1, the method including: s1, acquiring a non-convex optimization target with interruption probability constraint corresponding to wireless communication requirements in a non-perfect global channel state information environment; s2, converting the non-convex optimization target obtained in the step S1 to obtain a non-convex optimization target without interruption probability constraint; s3, acquiring imperfect global channel state information of the wireless network; s4, taking the non-convex optimization target in the step S2 as a training target, taking the imperfect global channel state information in the step S3 as input, and training an initial resource allocation system to be converged in a reinforcement learning mode, wherein the initial resource allocation system is a system which is constructed based on an intelligent agent and is used for generating an action set based on a wireless network state, and the action set comprises a channel allocation strategy and a power allocation strategy. In order to better describe the specific scheme of the invention, the following is a detailed explanation of several aspects of establishing a non-convex optimization target without interruption probability constraint, model training and experimental verification.

1. Establishment of non-convex optimization targets without outage probability constraints

Since the existing learning-based network resource allocation methods are not designed for imperfect CSI, the following will explain in detail the establishment and conversion from the optimization objective of the wireless communication network for better understanding. Wherein, for ease of understanding, the process is described in terms of formula derivation in the embodiments of the present invention.

The invention firstly carries out mathematical description on the wireless network environment under imperfect CSI, and further models the wireless network environment based on the mathematical description of the environment. The wireless communication network comprises a plurality of communication areas, each communication area is provided with a base station and a plurality of users, all users in the plurality of communication areas share a plurality of sub-channels, each base station is positioned in the center of the area, and authorized users are randomly distributed in the communication area; all users and transceivers of the base station are equipped with one antenna and each formed link can only select one subchannel in one time slot. For example, one downlink in a wireless network has a multi-cell multi-user network scenario, where K links are distributed among M cells and share N orthogonal subchannels, where,

And->

Respectively representing a link index set, a cell index set, and a subchannel index set.

In a wireless communication environment, considering a fully synchronized system with a time interval, the independent channel gain of the kth link to select the nth sub-channel at time slot t can be expressed as:

wherein,,

represents a large-scale fading component taking into account shadow fading and geometric fading, wherein +.>

Is constant in a plurality of time slots; />

Representing the estimated small-scale fading component of the kth link selecting the nth sub-channel at time slot t.

In a wireless communication environment, considering normalized bandwidth, in case of perfect CSI, the kth link selects the maximum spectral efficiency of the nth sub-channel at time slot t:

wherein,,

representing the identification value of the kth link after the selection of the nth sub-channel in time slot t, e.g +.>

Indicating that the kth link selects the nth sub-channel in time slot t, otherwise +.>

Representing the power, sigma, of the kth link selecting the nth sub-channel in time slot t ² Power representing additive white gaussian noise, +.>

Indicating the co-channel interference experienced by the kth link when the nth sub-channel is selected in time slot t.

In a wireless communication environment, due to unavoidable channel estimation errors, in perfect CSI,

Assuming a true value, suchIt is assumed that channel estimation errors in the actual communication environment are ignored. Thus, objective estimation of the small-scale fading component is required. Assuming that the base station can perfectly estimate the large-scale fading coefficient because it changes slowly, and the small-scale fading coefficient changes rapidly, so that it cannot be perfectly estimated, then in one embodiment of the present invention, based on the imperfect CSI, the estimated small-scale fading component of the nth sub-channel selected by the kth link at the time slot t is expressed as:

wherein,,

wherein,,

representing the estimated small-scale fading component of the kth link selecting the nth sub-channel at time slot t,

representing the error of the estimated small-scale fading component of the nth sub-channel selected by the kth link at time slot t, and each

Independent of each other. />

Mean value 0, variance ++>

Is a complex Gaussian distribution of->

Mean value 0, variance ++>

Is a complex Gaussian distribution of->

Representing the variance of the channel estimation error. It should be noted that the defect of the existing perfect CSI information mainly means that the fading coefficient of small scale cannot be estimated perfectly, see formula (3). Channel estimation value of fading coefficient of small scale due to existence of channel estimation error and other factors +. >

Typically not equal to the true value. If the algorithm based on the existing perfect CSI is directly applied to the environment of the imperfect CSI, the method is equivalent to directly adding the estimated value +.>

The resource allocation is performed as a true value, and since there is an error (i.e., a channel estimation error) between the estimated value and the true value, if the network resource allocation is performed directly using the algorithm based on the perfect CSI, the effect of improving the transmission performance and the network capacity is general. In practice, there must be a channel estimation error and other factors, so that CSI cannot be perfectly estimated, and therefore, this imperfect CSI factor needs to be considered, and directly using a resource allocation algorithm based on the existing perfect CSI is equivalent to directly replacing the actual value with the estimated value, which may reduce the performance of the algorithm.

After the mathematical description of the above wireless network environment, the following will require modeling of the optimization problem (i.e., wireless communication requirements). The optimization problem needs to include at least maximizing throughput and maximizing spectral efficiency. Since the maximized throughput and the maximized spectral efficiency can be converted from each other by a formula in definition, in the embodiment of the present invention, the maximized spectral efficiency is taken as an example for modeling explanation, and the modeling of the maximized throughput is not described herein in detail.

Due to the non-perfect CSI, the scheduled spectral efficiency may exceed the maximum achievable spectral efficiency defined by shannon capacity formula. Thus, when scheduling spectral efficiencyBeyond the spectral efficiency achievable with imperfect CSI, outage probability is used as an indicator of performance. The scheduling spectral efficiency when the kth link selects the nth sub-channel at time slot t is expressed as

The average spectral efficiency of the wireless network at time slot t is given by:

further, in the time slot t, under the imperfect CSI, the non-convex optimization objective with outage probability constraint corresponding to the maximum spectrum efficiency of the wireless network is:

wherein,,

wherein,,

Representing a set of sub-channel indices,/->

Representing the estimated small-scale fading component +.>

Is->

Probability of->

Power set, +.>

Since the non-convex optimization objective with outage probability constraints has proven to be an NP-Hard problem (a problem to which all non-deterministic polynomial problems can be reduced within polynomial time complexity) even when the subchannel strategy in the radio network resource is fixed, considering only the power allocation problem, the optimal solution of the non-convex optimization objective with outage probability constraints can be solved directly by mathematical derivation. Aiming at solving the problem, the invention adopts a parameter transformation mode to convert an original optimization target (namely, a non-convex optimization target with interrupt probability constraint) into a non-convex optimization target without interrupt probability constraint (through replacement of constraint conditions and corresponding solving conversion), so that the non-convex optimization target with interrupt probability constraint corresponding to the maximum frequency spectrum efficiency of the wireless network can be solved. The process of converting the original optimization objectives by means of parametric transformations will be described in detail in the following from the two parts of constraint replacement and optimization problem conversion.

In the constraint replacement process, the inventor considers a stricter constraint R1 to replace the outage probability constraint M1, so that the constraint R1 always meets the outage probability constraint M1 in the non-convex optimization target, wherein the constraint R1 is:

wherein,,

representing the noise and interference signal strength of the kth link under the definition of shannon's formula for selecting the nth sub-channel at time slot t,/>

Representing the noise and interference signal strength of the kth link in the actual schedulable selection of the nth sub-channel in time slot t,/->

Indicating the useful signal strength of the kth link in the actual schedulable selection of the nth sub-channel in time slot t,/for>

Indicating the useful signal strength of the kth link under the definition of shannon's formula to select the nth sub-channel at time slot t. Constraint R1-1 is expressed in +.>

For all k and n under the conditions +.>

Less than->

The probability of (2) cannot be greater than

The constraint R1-2 is expressed in +.>

For all k and n under the conditions +.>

Less than->

The probabilities of (2) are equal to +.>

The proving process in which the constraint R1 is more stringent than the interrupt constraint M1 will be explained below. The proving process comprises two parts of parameter definition and proving reasoning.

The parameter definition part is as follows: first defined according to shannon's formula

Similarly, in the case of imperfect CSI, the scheduling spectral efficiency of the kth link to select the nth sub-channel in time slot t is:

Wherein,,

indicating that the kth link under actual schedulability selects the channel-to-interference ratio of the nth sub-channel at time slot t.

From equation (2) and equation (7)

Then

From equation (8)/>

The original outage probability constraint M1 may be written as: />

From equations (7), (8) into equation (9):

from the full probability formula:

wherein,,

wherein Pr (E) ₁ ) Is shown in

And->

Under the condition->

Less than->

Probability of Pr (E) ₂ ) Is indicated at->

And->

Under the condition->

Less than->

Is a probability of (2).

The proving and reasoning part is as follows:

wherein the constraint R1-1 in the constraint R1 proves as follows:

will restrict R1-1

Replaced by->

Then->

And then get->

For Pr (E) ₂ ) For the sake of->

There is->

Must be less than->

The constraints R1-2 in the constraint R1 are demonstrated as follows:

from the full probability formula

Due to Pr (E) ₁ ) Less than or equal to 1, there is->

From the above-described proving process of constraint R1-1 and constraint R1-2 in constraint R1, it can be seen that constraint R1 is a stricter constraint than constraint M1.

In the optimization problem conversion process, the original optimization problem is converted according to a stricter constraint R1. The specific derivation process is explained below from the transformation of constraint R1-1 in constraint R1, the transformation of constraint R1-2, and the transformation of the non-convex optimization objective with the outage probability constraint.

According to the stricter constraint R1-1 mentioned above, it is possible to obtain:

from the Markov inequality, we can get from equation (12):

let the right side of equation (13) equal

Then there are:

according to the stricter constraint R1-2 mentioned above, it is possible to obtain:

wherein F represents a cumulative distribution function of chi-square distribution, let equation (15) equal to

The method can obtain the following steps:

wherein F is ^-1 An inverse Cumulative Distribution Function (CDF) representing the chi-square distribution. Due to

And is also provided with

Substituting these two terms and equation (16) into equation (14) yields: />

Thus, it is possible to obtain:

equation (18) is equivalent to:

thus, the average spectral efficiency of the wireless network after the parameter transformation at time slot t is expressed as:

wherein F is ^-1 An inverse Cumulative Distribution Function (CDF) representing the chi-square distribution.

In summary, the conversion from the non-convex optimization objective with the outage probability constraint to the non-convex optimization objective without the outage probability constraint is:

wherein,,

wherein Ω ^t The wireless network is represented by the average spectrum efficiency in time slot t after parameter transformation. It should be noted that, in the present invention, since the imperfect CSI caused by the channel estimation error is considered in the resource allocation in a more practical scenario, but the imperfect CSI may cause the outage probability, the constraint of having the outage probability in the optimization model cannot be satisfied The original algorithm based on perfect CSI is directly used for solving. Therefore, after the parameter conversion is carried out on the optimization model, a new learning algorithm based on the imperfect CSI is designed aiming at the converted optimization model, and parameters such as the imperfect CSI, channel estimation errors and the like are designed as a part of a state set, so that the deep reinforcement learning network can effectively learn the influence of the imperfect CSI to improve the optimization target effect and the performance of the algorithm which can be achieved by the learning algorithm under the imperfect CSI environment.

2. Model training

The non-convex optimization target with the outage probability constraint is converted through the steps, and the non-convex optimization target (formula 21) without the outage probability constraint is obtained, and still belongs to the NP-Hard problem. Conventional algorithms, such as the solution described in reference [5] mentioned in the background section, require multiple iterations to converge and do not scale well as the number of user links increases. Furthermore, it is very challenging for a centralized controller in a communication system to acquire instantaneous global CSI and send an allocation scheme back to the BS. In order to make the non-convex optimization objective without outage probability constraint resolvable, the joint wireless communication requirement (the present embodiment is described by taking the wireless communication requirement as an example to maximize the spectrum efficiency of the wireless network, but it does not mean that the wireless communication requirement only maximizes the spectrum efficiency) is first decoupled into two sub-problems, namely, a sub-channel selection sub-problem and a power allocation sub-problem. The problem of maximizing the spectral efficiency of the wireless network is then handled with a learning model (initial resource allocation system) that can handle both sub-problems simultaneously to improve the convergence performance of the final resource allocation system and the effectiveness of the optimization objective. The final resource allocation system is obtained by taking a non-convex optimization target as a training target, and adopting a training set formed by non-perfect global channel state information and a resource allocation strategy related to the training target to train the initial resource allocation system.

According to one embodiment of the invention, an initial resource allocation system includes: channel allocation model (also referred to as a first tier network in the present embodiment) and power allocation model (also referred to as a second tier network in the present embodiment)) The method comprises the steps of carrying out a first treatment on the surface of the The channel allocation model is used for predicting the channel allocation strategy of a time slot based on the imperfect global channel state information of the time slot; preferably, the channel allocation model is configured as a DQN network, DDQN network, or a Dueling DQN network; the power allocation model is used for predicting the power allocation strategy of a time slot based on the imperfect global channel state information of the time slot, and is preferably configured as a DDPG network. It should be noted that, the channel allocation sub-problem belongs to a discrete type task, the power allocation sub-problem belongs to a continuous type task, and quantization errors can be avoided from being introduced in the dual-layer learning network architecture formed by the channel allocation model and the power allocation model in the foregoing embodiment. Wherein, for channel allocation, a DQN network, a DDQN network or a Dueling DQN network is adopted to process discrete variable type resources; for power allocation, the channel power is defined by P _max A limited continuous scalar (for some algorithms, such as a value-based DQN algorithm, the action space must be limited, the transmit power may be discretized, the discretization of the continuous variable necessarily results in quantization errors), in order to avoid the channel power being discretized, the second layer network in the present invention employs a DDPG network, which includes an Actor network and a Critic network, the Actor network acting to output the allocated power, the Critic network acting to evaluate the action of the Actor network and update parameters in the Actor network; the layer two network, through the Actor network in the DDPG, can output a power allocation policy consisting of deterministic power allocation actions based on the imperfect global channel state information of one slot. Therefore, the optimal sub-channel action can be learned faster by adopting the DQN network, the DDQN network or the lasting DQN network, and the non-convex optimization target which does not contain the outage probability constraint can be obtained by combining the DDPG to process the continuous variable type resource (channel power distribution), so that the method has faster convergence rate and higher frequency spectrum efficiency compared with the conventional algorithm.

According to one embodiment of the present invention, the step S4 includes a step S41, a step S42, and a step S43 in training the initial resource allocation system. Wherein, step S41 includes: acquiring imperfect global channel state information of an input time slot and executing the following steps: s411, predicting a channel allocation strategy of an input time slot according to the imperfect global channel state information of the input time slot by a channel allocation model, updating the imperfect global channel state information of the input time slot based on the predicted channel allocation strategy, and predicting the power allocation strategy of the input time slot by a power allocation model according to the updated imperfect global channel state information of the input time slot; the channel allocation strategy and the power allocation strategy of the predicted input time slot interact with the wireless network to obtain imperfect global channel state information of the next time slot of the input time slot, the channel allocation strategy of the next time slot of the input time slot is predicted by a channel allocation model according to the imperfect global channel state information of the next time slot of the input time slot, and the channel allocation strategy of the next time slot of the input time slot is updated based on the channel allocation strategy of the next time slot of the input time slot; s412, calculating the frequency spectrum efficiency rewards of the input time slot based on the channel allocation strategy and the power allocation strategy of the input time slot; s413, storing a piece of channel allocation experience into a channel selection playback pool according to the non-perfect global channel state information of the input time slot, the channel allocation strategy of the input time slot, the frequency spectrum efficiency rewards of the input time slot and the non-perfect global channel state information of the next time slot of the input time slot; and storing the experience of power allocation into a power selection playback pool by using the updated non-perfect global channel state information of the input time slot, the power allocation strategy of the input time slot, the frequency spectrum efficiency rewards of the input time slot and the non-perfect global channel state information of the next time slot of the updated input time slot. Step S42 includes: the non-perfect global channel state information of the next slot of the last input slot is the non-perfect global channel state information of the new input slot. Step S43 includes: the initial resource allocation system parameters are updated based on the channel allocation experience in the channel selection playback pool and the power allocation in the power selection playback pool until convergence. It should be noted that, calculating the spectral efficiency rewards of the input time slots based on the channel allocation policy and the power allocation policy of the input time slots, where the spectral efficiency rewards represent the overall contributions of the channel allocation and the power allocation to the optimization objective, may enable the channel allocation model and the power allocation model to share the same rewards function and work cooperatively with the objective of maximizing the spectral efficiency of the wireless network.

In accordance with one embodiment of the present invention, in the step S43, updating parameters of the channel allocation model is started when there is a channel allocation experience in the channel selection playback pool; updating parameters of the power allocation model begins when there is a power allocation experience in the power selection playback pool.

According to one embodiment of the present invention, in the step S43, the parameters of the channel allocation model are updated a plurality of times until convergence is achieved after the channel allocation experience in the channel selection playback pool reaches a preset number of experiences, wherein each update is performed to obtain a plurality of channel allocation experiences by randomly sampling from the channel selection playback pool, and the parameters of the channel allocation model are updated in a gradient descent manner based on the sampled channel allocation experiences; and updating the parameters of the power distribution model for a plurality of times until convergence is achieved after the power distribution experience in the power selection playback pool reaches the preset number of experiences, wherein a plurality of power distribution experiences are obtained by randomly sampling from the power selection playback pool when each update is performed, and the parameters of the power distribution model are updated in a gradient descent mode based on the sampled power distribution experiences. It should be noted that, when the number of experience pieces in the channel selection playback pool reaches the threshold value, the channel allocation experience stored in the channel selection playback pool is stored in the channel selection playback pool in a manner of replacing the channel allocation experience stored first in the current channel selection playback pool (i.e., in a first-in first-out manner); when the number of experience pieces in the power selection playback pool reaches a threshold value, the power allocation experience stored in the power selection playback pool is stored in the power selection playback pool in a manner that replaces the first stored power allocation experience in the current channel selection playback pool (i.e., a first-in-first-out manner). The experience in the channel selection playback pool or the power selection playback pool reaches the preset number and then starts to be used for random sampling, so that the convergence speed of the channel allocation model or the power allocation model can be increased; the threshold value of the experience number is set, so that the hardware requirement of model training can be reduced; the channel selection playback pool or the power selection playback pool adopts a first-in first-out mode to store experiences, so that the newly generated better experience can well replace relatively worse experience, the experience in the channel selection playback pool or the power selection playback pool is in a storage state of optimal experience when being sampled, and the convergence speed of a channel allocation model or a power allocation model is further accelerated.

In order to better train the initial resource allocation system, the imperfect global channel state information of the input time slot is firstly obtained in the process of training the initial resource allocation system. According to one embodiment of the present invention, in training the initial resource allocation system, the non-perfect global channel state information of the input time slot includes non-perfect global channel state information (also referred to as a state set in the present invention) of a plurality of links selecting different sub-channels in the input time slot, wherein the non-perfect global channel state information of one link after selecting the different sub-channels in the input time slot is a state set, and the non-perfect global channel state information of the plurality of links after selecting the different sub-channels in the input time slot is:

wherein,,

wherein,,

Indicating the power of the kth link selecting the nth sub-channel in time slot t-1, +. >

The ratio to total interference power is the rank value over all channels,

indicating that the kth link adopts the subchannel allocation scheme of the last time slot and co-channel interference under the power allocation scheme when the nth subchannel is selected by the time slot t, k' indicating other links different from k>

Representing the variance of the channel estimation error,/and>

Mean value 0 and variance 0

Is a complex gaussian distribution of (c). According to one embodiment of the invention, when the imperfect global channel state information of the input time slot is a state set, the first layer network is configured as the same number of channel allocation models as the number of links, and the second layer network is configured as the same number of power allocation models as the number of links, wherein one state set is processed by one channel allocation model and one power allocation model. According to one embodiment of the invention, when the imperfect global channel state information of the incoming time slot is a state set, the first layer network is configured as a channel allocation model and the second layer network is configured as a power allocation model, wherein each state set in the state set is processed in turn by one channel allocation model and one power allocation model. The first layer network is configured as the same number of channel allocation models as the number of links and the second layer network is configured as the same number of power allocation models as the number of links, so that the processing speed of the initial resource allocation model can be improved. According to one embodiment of the present invention, when the non-perfect global channel state information of the input slot is a state set in step S411, the non-perfect global channel state information of the input slot is updated based on the predicted channel allocation policy as follows: and selecting a state set corresponding to the channel allocation strategy for executing the prediction from the state set based on the channel allocation strategy predicted by the channel allocation model as the imperfect global channel state information for updating the input time slot.

It should be noted that, the selection of the state set is important for the training effect of the initial resource allocation system, and the selection of the state set should embody the feature of the imperfect CSI, that means that the relevant channel state information must be selected to embody the information of the imperfect CSI as the element in the state set, so as to avoid the unnecessary channel state information in the imperfect global channel state information, and improve the training effect of the initial resource allocation model. The variance of the channel estimation error, the estimated channel gain (independent channel gain), and the ranking value of the ratio of the estimated small-scale fading component corresponding to each link in a certain time slot selection sub-channel to the total interference power on all channels are key features which can best embody the channel state information corresponding to the imperfect CSI.

According to one embodiment of the invention, the spectral efficiency rewards are calculated as follows:

wherein,,

wherein,,

Indicating the scheduling spectral efficiency of the kth link in time slot t for selecting the nth sub-channel, phi being the weighting coefficient of the interference, k' indicating other links than k, < >>

Indicating the spectral efficiency corresponding to the kth' link selecting the nth sub-channel in time slot t. It should be noted that by defining the weight coefficient of the interference, the variance of the reward function can be reduced, and preferably, Φ=1.

In order to better explain the parameter updating process of the initial resource allocation system of the present invention, the following description will be given by taking an example of the process of performing parameter updating in a manner of randomly sampling from the channel selection playback pool and the power selection playback pool in the initial resource allocation system composed of the Dueling DQN network and the DDPG network. It should be noted that, selecting the lasting DQN network can distinguish whether the spectrum efficiency gain depends on the sub-channel action taken, or simply because the input state set is better, the channel allocation task can be well completed, so as to further improve the performance of the initial resource allocation system after convergence.

As shown in fig. 2, the initial resource allocation system is composed of a Dueling DQN Network, a DDPG (including an Actor Network) and a Critic Network (Critic Network), wherein the Actor Network and the Critic Network constitute the DDPG Network. It should be noted that, the above-mentioned channel allocation policy corresponds to a channel selection action of each link to a subchannel in each time slot, and the power allocation policy corresponds to a power allocation action of each link to the selected subchannel in each time slot. The process of generating the channel allocation experience and the power allocation experience will be schematically explained in the following in the initial resource allocation system: state set of kth link in time slot t

Inputting a Dueling DQN network, the Dueling DQN network predicts a channel selection action of the sub-channel +.>

Channel selection action in predicted sub-channel>

After that, based on->

Update->

Get updated state set +.>

(i.e.)>

Middle->

Corresponding state set->

) And use->

As an input to the DDPG network, an Actor network in the DDPG network is based on +.>

Predicting power allocation actions of sub-channels +.>

The base station performs two actions in sequence at the beginning of the time slot t (/ -)>

And->

) To determine its associated sub-channel and the transmit power of this sub-channel, the base station performs in sequence two actions (/ -) >

And->

) Then, after interacting with the wireless network environment (i.e. ultra-dense network environment under imperfect CSI), a state set of next time slot t+1 is generated>

And channel selection actions of sub-channels predicted based on channel allocation model->

And channel allocation actions of sub-channels predicted by the power allocation model +.>

Calculating spectral efficiency rewards->

The input lasting DQN network predicts the selection of sub-channels>

Channel selection action in predicted sub-channel>

After that, based on->

Update->

Get updated state set +.>

(i.e.)>

In (a)

Corresponding state set->

). Wherein, will->

Storing in a channel selection playback pool (shown in FIG. 2) as a channel allocation experience and storing +.>

The power selection playback pool (shown in fig. 2) is stored as a power allocation experience. Whereby channel allocation experience and power allocation experience are continuously generated during training of the initial resource allocation system.

In the parameter updating process, parameters of the reducing DQN are updated to converge based on experience in the channel selection playback pool and parameters of the DDPG network are updated to converge based on experience in the power selection playback pool. It should be noted that, the Critic network in the DDPG is only used during training, and only the Actor network is needed to perform power allocation during actual deployment, that is, taking an initial resource allocation system formed by the lasting DQN network and the DDPG network as an example, the final resource allocation system is as follows: training to converge on the Dueling DQN network and the Actor network.

Since training the Dueling DQN and DDPG network to converge is a process known to those skilled in the art, specific conditions for convergence are not described herein, and the initial resource allocation system composed of the Dueling DQN network and the DDPG network continues to explain a parameter update process of the initial resource allocation system. Preferably, a parameter updating process of the initial resource allocation system is explained by adopting a mode of randomly sampling a channel selection playback pool to obtain a plurality of channel allocation experiences and calculating a gradient to update parameters of the lasting DQN network based on the sampled channel allocation experiences, and a mode of randomly sampling a power selection playback pool to obtain a plurality of power allocation experiences and calculating a gradient to update parameters of the DDPG network based on the sampled power allocation experiences.

Preferably, the channel allocation experience set B is derived from randomly sampling channel allocation experiences of a channel selection playback pool ₁ And calculates the gradient to update the parameters of the lasting DQN network using the following rules:

wherein,,

wherein θ _c Representing trainable parameters of hidden layer in a Dueling DQN network, β representing trainable parameters of fully connected layer of value function V, χ representing trainable parameters of fully connected layer of dominance function a, B ₁ Represents a set of channel allocation experiences for random sampling, wherein,

representation B ₁ One of the channel allocation experiences, |b ₁ I represents the number of experience bars of the channel allocation experience set, < +.>

Representing the target value of the lasting DQN, < >>

A set of state information representing time slot t+1, Q'>

The Q function value representing time slot t, gamma' represents the discount coefficient in the lasting DQN network,

respectively represent Dueling DParameters of the target network in the QN network, +.>

Sub-channel selection action representing time slot t+1, < >>

The representation is at->

Cost function of time->

Is indicated in the state->

Time-selective action->

Dominance function value->

Is indicated in the state->

Time-selective action->

Dominance function value, |A| represents +.>

Is a sum of the number of times of calculation.

For updates of the DDPG network, the DDPG network approximates a behavior value function Q (s, a) (Critic network) and an action function u using a neural network _θ (s) (Actor network).

Preferably, in order to update the network parameter θ of the Critic network _Q A power distribution experience set B is obtained from random sampling power distribution experiences of a power selection playback pool by adopting a time difference (TD, temporal difference) error method ₂ And the parameters of the Critic network are updated by calculating the minimum mean square error under the following rules:

wherein,,

wherein B is ₂ Represents a set of randomly sampled power allocation experiences, wherein,

represents one power allocation experience in the power allocation experience set,/->

The Q function representing Critic network is at input +.>

Function value of time->

Channel selection action (determined by the Actor network activation function) indicating that the kth link of the Actor network outputs a fixed subchannel over the aging DQN at time slot t>

Power allocation action of determination of power of rear output, |b ₂ I represents the empirical number of the randomly sampled power allocation empirical set, y' represents the target value of the DDPG network, y represents the discount coefficient in the DDPG network, +.>

The Q function representing Critic network is at input +.>

Function value at that time.

Based on collected power allocation experience set B ₂ And the following rules are used to calculate gradients to update the parameters of the Actor network:

wherein,,

the Q function representing the Critic network is at the input +.>

The function value of the time is calculated,

the Q function representing the Critic network is at the input +.>

Function value at that time.

It should be noted that, when the initial distribution network is formed by the DQN network or the DDQN network and the DDPG network, the Q function and the update process of the DQN network or the DDQN network are known to those skilled in the art, and the parameter update process of the DQN network or the DDQN network is not described here.

In addition, on the basis of the above-mentioned construction method of the radio network resource allocation system, as shown in fig. 3, the present invention further provides a radio network resource management method, which includes mathematically describing the whole radio network environment by adopting the construction process of the non-convex optimization target with the outage probability constraint in the method for constructing the radio network resource allocation system, forming the non-convex optimization target with the outage probability constraint based on the mathematical description (i.e. the model construction in fig. 3), then converting the non-convex optimization target with the outage probability constraint into the non-convex optimization target without the outage probability constraint (i.e. the parameter transformation in fig. 3), adopting the initial resource allocation system to perform solution training (i.e. the dual-layer network architecture in fig. 3), and when the initial resource allocation system converges, obtaining the resource allocation scheme by using the resource allocation system to allocate the radio network resources in the radio communication system. According to one embodiment of the present invention, a wireless network resource management method includes: t1, acquiring a wireless network state of a wireless communication system in a last time slot; t2, based on the wireless network state of the last time slot obtained in the step T1, predicting a resource allocation strategy of the wireless resource allocation system at the next time by adopting the method for constructing the wireless network resource allocation system; and T3, distributing wireless network resources in the wireless communication system based on the resource distribution strategy of the next moment obtained in the step T2.

The present invention also provides a radio communication system including a plurality of base stations each including a radio resource management unit configured to allocate radio network resources in the base station using the above-described radio network resource management method. The wireless network resource distribution system configured in the wireless resource management unit adopts a centralized training mode in a training stage, namely, a state set in a multi-cell multi-user network scene is selected for training; and in the deployment stage, the trained wireless network resource distribution system obtained by centralized training is distributed to each base station. Thereby improving the effect of wireless communication system for allocating wireless network resources.

3. Experiment verification

In order to better illustrate the technical effects of the invention, the following simulation experiments are used for verification.

First, the simulation parameter setting is introduced: setting the wireless network scene as a downlink with multi-cell multi-user network scene, wherein K links are distributed in M cells and share N orthogonal sub-channels, namely M/K users are arranged in each cell; for cell i, the base station BS is located in the center of cell i and serves M/K users randomly distributed within the cell; the large-scale path loss is 128.1+37.6log ₁₀ (d) Calculation, where d is the distance from the transmitter to the receiverIn kilometers; the upper limit of the SINR (signal to interference plus noise ratio) received by the user is set to be 30dB, and the noise power sigma ² Setting to-114 dBm, the optimization problem is required to select the maximum spectral efficiency, and the initial distribution network of the present invention employs a Dueling DQN network (i.e., a Dueling DQN network) and a DDPG network, each having three hidden layers and 200, and 100 neurons, respectively. Simulation parameters in addition to the above settings, the detailed setting parameters of the remaining simulation parameters are shown in table 1.

TABLE 1

Simulation parameters	Value of	Simulation parameters	Value of
				Cell radius	200m	Power threshold of link	38dB
Interrupt probability	0.1	Channel estimation error variance	0.1
				Time slot spacing	20ms	Weighting coefficients of interference	1

In the simulation experiment, the method for constructing the wireless network resource allocation system (hereinafter referred to as the algorithm) is adopted to carry out three groups of comparison experiments (testing the convergence, generalization capability and spectrum efficiency performance of a training stage and the performance of the spectrum efficiency of the algorithm under different subchannel numbers) together with other baseline algorithms, wherein the baseline algorithms comprise a random algorithm, a random algorithm and an FP algorithm (split planning algorithm) described in a reference document [5] in the background art, a joint learning algorithm (joint learning algorithm) described in a reference document [7] and a distribution learning algorithm described in a reference document [9 ]; specifically, the partial planning algorithm comprises randomly allocated sub-channels and power values and a traditional model-driven algorithm, and has higher computational complexity; the joint learning algorithm adopts a DQN network to optimize two variables, namely a subchannel and power; the distribution learning algorithm optimizes the sub-channels with DQN and optimizes the power with DDPG under perfect CSI.

In the training stage, the training process of the algorithm and the four baseline algorithms comprises 20 epodes, each epode comprises 2000 time slots, namely, the algorithm stops training and parameter updating after a fixed 2000 time steps are carried out in one epode, and new user distribution is set and learning rate and other parameters are reset at the beginning of each epode, so that the algorithm and the four baseline algorithms are converged. The algorithm proposed in this patent and the data corresponding to the above four baseline algorithms will be described in comparison.

In order to test the convergence of the algorithm proposed by the patent and the four baseline algorithms in the training phase, only parameters such as learning rate and the like are reset in each epoode, and the distribution of users is not updated. When the number of users is 25 and the number of base stations and the number of sub-channels is 5, the convergence performance of the algorithm proposed by the present patent and the four baseline algorithms is shown in fig. 4, and it can be seen that the algorithm proposed by the present patent and the four baseline algorithms converge to about 4.0bps/Hz at each epicode base (except for the random algorithm and the split planning algorithm). In the converged algorithm, the convergence speed of the algorithm provided by the patent is faster than that of a reference algorithm (a joint learning algorithm and a distributed learning algorithm) based on learning, in addition, the frequency spectrum efficiency of the four baseline algorithms is lower than that of the algorithm provided by the patent, and when the iteration times are fewer, namely 5000-6000 iterations, the frequency spectrum efficiency of the four baseline algorithms is far lower than that of the scheme. Thus, the algorithm provided by the patent has significant advantages in terms of convergence rate.

In generalization capability and spectral efficiency performance tests, it is not practical to track the environment through frequent online training, since the channel estimation errors in a real dynamic wireless communication scenario are time-varying. Thus, the generalization ability of the algorithm is very important in a constantly changing environment. The training model with the best result when the channel estimation error variance in the above simulation parameter setting is set to 0.01. Tests are carried out under different conditions, and the relation between the frequency spectrum efficiency and the channel estimation error variance, which can be realized by the algorithm provided by the patent and the four baseline algorithms, is shown in fig. 5: the performance of the stochastic algorithm and the fractional programming algorithm deteriorates with increasing channel estimation error; the spectral efficiency of the joint learning algorithm, the distribution learning algorithm and the algorithms proposed by this patent remains almost unchanged with the variation of the channel estimation error. Therefore, the algorithm provided by the patent has stronger generalization capability under different channel estimation errors and can realize higher spectrum efficiency performance.

In the performance test of the spectrum efficiency of the algorithm under different numbers of sub-channels, when the variance of the channel estimation error is 0.1, the performance of the spectrum efficiency of the algorithm provided by the patent and the four baseline algorithms under different numbers of sub-channels is shown in fig. 6: the average spectrum efficiency of each link of the algorithm proposed by the patent and the four baseline algorithms can be gradually increased along with the increase of the number of the sub-channels, however, the algorithm proposed by the patent can obtain higher spectrum efficiency than the four baseline algorithms. That is, the algorithm proposed by the present patent is easier to expand in a multi-cell network and performs better as the number of subchannels increases.

In summary, since the existing learning-based method is not designed for the imperfect CSI, and the channel estimation error in the actual communication environment exists objectively and cannot be completely eliminated, the optimization objective effect (i.e., communication performance) achieved by directly adopting the existing learning-based algorithm in the imperfect CSI environment is poor and the convergence rate of the algorithm is low. The estimated channel gain and the corresponding error (represented by the variance of the channel estimation error) are designed to the state set of the initial resource allocation system, and the corresponding reward (for example, spectrum efficiency reward) is designed in a targeted manner, so that the initial resource allocation system can obtain better gain under the channel estimation error, the convergence rate of the initial resource allocation system is better than that of the baseline algorithm, and the spectrum efficiency performance can be obviously improved, that is, the provided method for constructing the wireless network resource allocation system is more suitable for obtaining the final resource allocation system in the actual dynamic wireless communication environment (imperfect CSI).

It should be noted that, although the steps are described above in a specific order, it is not meant to necessarily be performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order, as long as the required functions are achieved.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for constructing a wireless network resource allocation system for obtaining a wireless network resource allocation policy according to a wireless network state, the method comprising:

s1, acquiring a non-convex optimization target with interruption probability constraint corresponding to wireless communication requirements in a non-perfect global channel state information environment;

s2, converting the non-convex optimization target obtained in the step S1 to obtain a non-convex optimization target without interruption probability constraint;

s3, acquiring imperfect global channel state information of the wireless network;

s4, taking the non-convex optimization target in the step S2 as a training target, taking the imperfect global channel state information in the step S3 as input, and training an initial resource allocation system to be converged in a reinforcement learning mode, wherein the initial resource allocation system is a system which is constructed based on an intelligent agent and is used for generating an action set based on a wireless network state, and the action set comprises a channel allocation strategy and a power allocation strategy.

2. The method of claim 1, wherein the wireless communication requirement is to maximize spectral efficiency of a wireless network, the non-convex optimization objective with outage probability constraints:

Wherein,,

wherein,,

represents the average spectral efficiency of the wireless network at time slot t, K represents the total number of links N represents the total number of subchannels,

representing a set of sub-channel indices,/->

Indicating the maximum spectral efficiency of the kth link selecting the nth sub-channel in time slot y,/for the kth link>

Representing the estimated small-scale fading component +.>

Is->

Probability of->

Power set, +.>

3. The method according to claim 2, characterized in that in said step S2, the non-convex optimization objective is transformed by means of a parametric transformation to obtain a non-convex optimization objective free of outage probability constraints:

wherein,,

4. A method according to claim 3, wherein the initial resource allocation system comprises:

a channel allocation model for predicting a channel allocation policy of a slot based on imperfect global channel state information of the slot, which is configured as a DQN network, a DDQN network, or a lasting DQN network;

a power allocation model for predicting a power allocation policy for a slot based on imperfect global channel state information for the slot, which is configured as a DDPG network.

5. The method according to claim 4, wherein the step S4 includes:

s41, acquiring imperfect global channel state information of an input time slot and executing the following steps:

s411, predicting a channel allocation strategy of an input time slot according to the imperfect global channel state information of the input time slot by a channel allocation model, updating the imperfect global channel state information of the input time slot based on the predicted channel allocation strategy, and predicting the power allocation strategy of the input time slot by a power allocation model according to the updated imperfect global channel state information of the input time slot; the channel allocation strategy and the power allocation strategy of the predicted input time slot are interacted with the wireless network to obtain the imperfect global channel state information of the next time slot of the input time slot,

Predicting a channel allocation strategy of the next time slot of the input time slot according to the imperfect global channel state information of the next time slot of the input time slot by a channel allocation model, and updating the channel allocation strategy of the next time slot of the input time slot based on the channel allocation strategy of the next time slot of the input time slot;

s412, calculating the frequency spectrum efficiency rewards of the input time slot based on the channel allocation strategy and the power allocation strategy of the input time slot;

s413, storing a piece of channel allocation experience into a channel selection playback pool according to the non-perfect global channel state information of the input time slot, the channel allocation strategy of the input time slot, the frequency spectrum efficiency rewards of the input time slot and the non-perfect global channel state information of the next time slot of the input time slot; storing a power allocation experience into a power selection playback pool by using updated imperfect global channel state information of an input time slot, a power allocation strategy of the input time slot, a spectrum efficiency reward of the input time slot and updated imperfect global channel state information of a next time slot of the input time slot;

s42, the imperfect global channel state information of the next time slot of the last input time slot is the imperfect global channel state information of the new input time slot;

S43, updating the initial resource allocation system parameters based on the channel allocation experience in the channel selection playback pool and the power allocation in the power selection playback pool until convergence.

6. The method according to claim 5, wherein in said step S43, updating parameters of the channel allocation model is started when there is a channel allocation experience in the channel selection playback pool; updating parameters of the power allocation model begins when there is a power allocation experience in the power selection playback pool.

7. The method according to claim 5, characterized in that in said step S43;

updating parameters of the channel allocation model for a plurality of times until convergence is achieved after the channel allocation experience in the channel selection playback pool reaches the experience of a preset number, wherein the channel allocation experience is obtained by randomly sampling from the channel selection playback pool during each updating, and the parameters of the channel allocation model are updated in a gradient descent mode based on the sampled channel allocation experience;

and updating the parameters of the power distribution model for a plurality of times until convergence is achieved after the power distribution experience in the power selection playback pool reaches the preset number of experiences, wherein a plurality of power distribution experiences are obtained by randomly sampling from the power selection playback pool when each update is performed, and the parameters of the power distribution model are updated in a gradient descent mode based on the sampled power distribution experiences.

8. The method according to any of claims 4-7, wherein the non-perfect global channel state information of the input slot in step S41 comprises non-perfect global channel state information of a plurality of links selecting different sub-channels in the input slot:

wherein,,

wherein,,

The ratio to total interference power is the rank value over all channels,

indicating the co-channel interference under the sub-channel allocation scheme and the power allocation scheme of the last time slot when the kth link selects the nth sub-channel in the time slot t, k ^′ Representing a link other than k +.>

Representing the variance of the channel estimation error,/and >

Mean value 0, variance ++>

Is a complex gaussian distribution of (c).

9. The method of claim 8, wherein the spectral efficiency prize is calculated by:

wherein,,

wherein,,

Representing the scheduling spectral efficiency of the kth link in time slot t for selecting the nth sub-channel, phi being the weighting coefficient of the interference, k ^′ Representing a link other than k +.>

Link k indicating no kth link interference in the nth sub-channel of time slot t ^′ Is used for the spectrum efficiency of the (c) spectrum,

represents the kth ^′ The link selects the spectral efficiency corresponding to the nth sub-channel in time slot t.

10. The method of claim 9, wherein the DDPG network comprises an Actor network and a Critic network,

the final resource allocation system is as follows: training to a converged DQN network, DDQN network, or a sustaining DQN network and an Actor network.

11. A method of wireless network resource management, the method comprising:

T1, acquiring a wireless network state of a wireless communication system in a last time slot;

t2, based on the wireless network state of the last time slot obtained in the step T1, predicting a resource allocation policy at the next time by using the resource allocation system obtained by the method according to any one of claims 1 to 10;

and T3, distributing wireless network resources in the wireless communication system based on the resource distribution strategy of the next moment obtained in the step T2.

12. A wireless communication system comprising a plurality of base stations, wherein each base station comprises a radio resource management unit configured to allocate radio network resources in the base station using the method as claimed in claim 11.

13. A computer readable storage medium, having stored thereon a computer program executable by a processor to implement the steps of the method of any one of claims 1 to 11.

14. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to perform the steps of the method of any of claims 1-11.