CN114051252A

CN114051252A - Multi-user intelligent transmitting power control method in wireless access network

Info

Publication number: CN114051252A
Application number: CN202111145720.3A
Authority: CN
Inventors: 张先超; 赵耀; 张庆华
Original assignee: Jiaxing University
Current assignee: Jiaxing University
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2022-02-15
Anticipated expiration: 2041-09-28
Also published as: CN114051252B

Abstract

The invention relates to a multi-user intelligent transmitting power control method in a wireless access network, which comprises the following steps: modeling and analyzing a communication system of each wireless access device which is accessed to the network to obtain a global channel state and a global sequence state of the wireless access device; determining a power control strategy of each wireless access device based on a Markov decision process of multiple individuals; determining an optimization target model of the power control strategy according to the average uplink transmission power consumption and the average uplink communication time delay of the wireless access equipment under the power control strategy; training the power control strategy by using a multi-agent deep reinforcement learning method to obtain a trained strategy network; and each wireless access device carries out intelligent transmission power control according to the trained strategy network. The invention reduces the time delay and power consumption of the whole uplink communication system, provides high-quality communication service by using limited resources, and has good realizability and expandability due to low complexity and distributed decision.

Description

Multi-user intelligent transmitting power control method in wireless access network

Technical Field

The invention relates to the technical field of communication, in particular to a multi-user intelligent transmission power control method in a wireless access network.

Background

In recent years, with the rapid development of mobile internet and artificial intelligence technology, smart wireless access devices such as smart phones, Augmented Reality (AR), Virtual Reality (VR), and the like, and smart applications such as telemedicine, industrial 4.0, autonomous driving, and the like, have entered a explosive growth stage, which means that a large number of wireless access devices will be accessed to a communication network, and the requirements of the smart wireless access devices on communication performance are more stringent and diversified than those of the prior mobile phones. In order to guarantee the communication service quality and experience of the access user, the limited wireless communication resources must be reasonably configured. The transmission power in these resources exerts direct and crucial influence, the power is low, the natural communication quality is poor, the power is high, the interference problem of multiple users can be caused, the communication quality is reduced, meanwhile, the problem of high power consumption of the wireless access equipment is concerned, and therefore, the control of the transmission power of multiple users in the future wireless access network is a key problem in the field of current wireless communication.

However, the current power control method based on the model and the numerical optimization algorithm faces the problems of difficult modeling, high algorithm complexity, too long solving time and the like when facing a future complex access network, and the method always needs to be re-optimized to adapt to new parameters when the environment changes, so that the method is difficult to be used for power control in practice. Therefore, the method considers a complex channel environment and a user demand queue, and performs distributed intelligent control on the transmitting power of multiple users in a wireless access network based on a multi-agent deep reinforcement learning technology so as to realize high-quality communication service with low power consumption and low time delay.

Disclosure of Invention

In view of the above analysis, the present invention aims to provide a method for controlling multi-user intelligent transmission power in a radio access network, which solves the problem that the prior art is difficult to be applied to the future radio access network.

The technical scheme provided by the invention is as follows:

the invention discloses a multi-user intelligent transmitting power control method in a wireless access network, which comprises the following steps:

modeling and analyzing a communication system of each wireless access device which is accessed to the network to obtain a global channel state and a global sequence state of the wireless access device;

determining a power control strategy of each wireless access device based on a Markov decision process of multiple individuals; determining an optimization target model of the power control strategy according to the average uplink transmission power consumption and the average uplink communication time delay of the wireless access equipment under the power control strategy;

training the power control strategy by using a multi-agent deep reinforcement learning method to obtain a trained strategy network;

and each wireless access device carries out intelligent transmission power control according to the trained strategy network.

Further, each wireless access device accessing the network performs uplink communication with a single base station in an OFDMA access manner, where the number of allocable subcarriers of the OFDMA is less than the number of wireless access devices; the OFDMA is non-orthogonal multiplexing of carriers, and carries information of more than one radio access device on the same subcarrier.

Further, in the non-orthogonal multiplexing, the achievable data rate of the base station receiving the wireless access device k on the subcarrier m is:

wherein H_k,m(t) the channel state information of the wireless access equipment k on the subcarrier m at the time t; p_k,m(t) transmitting power information of the wireless access equipment k on the subcarrier m at the moment t; h_j,m(t) channel state information of the wireless access equipment j in the subcarrier m at the time t; p_j,m(t) transmitting power information of the wireless access equipment j on the subcarrier m at the time t; Γ is the SINR gap due to the signal modulation multiplexing mode; n is a radical of₀Is the noise power.

Further, the queue of the wireless access device k on the subcarrier m is dynamic:

I_k(t) is the waiting transmission of wireless access equipment k at time tThe length of the sequence; c_k,m(t) is the achievable data rate for the base station to receive wireless access device k on subcarrier M, where M is the number of subcarriers.

Further, in step S2, based on the markov decision process, the wireless access device k controls the policy pi according to the corresponding power_kSelection action a_k(ii) a Entering a next state S (t +1) according to the current state S (t) of the wireless access equipment and the actions of all the wireless access equipment; and, each wireless access device gets a corresponding reward function r during state transition_k(t)＝r(S(t),a_k(t), S (t +1)), and obtains the observed quantity o of the new state by itself_k(t + 1); in the power control strategy, each wireless access device strives to maximize its own long-term return as

Where γ is the discount factor and T is the length of time.

Further, the optimization objective model of the power control strategy establishes the transmission power control problem of the multiple radio access devices in the radio access network according to the low power consumption and low time delay objectives as follows:

α_kand beta_kRespectively weighting positive values corresponding to the power consumption and the time delay of the wireless access equipment;

for control strategy pi_kNext, average uplink transmission power consumption and average uplink communication time delay of the wireless access equipment k; p_maxMaximum transmitting power for the wireless access equipment; p_k,m(t) transmitting power information of the wireless access equipment k on the subcarrier m at the moment t; m is the number of subcarriers;

the reward for each wireless access device in the optimization objective model is:

k is the number of wireless access devices; l is_k(t) queue dynamics for wireless access device k on subcarrier m; lambda [ alpha ]_kThe average arrival rate of packets for wireless access device k.

Further, the process of training the power control strategy by using a multi-agent deep reinforcement learning method comprises:

step S301, operating the power control strategy of each wireless access device in each iteration within the time length T; the central node of the wireless access network collects the action, the state and the reward of each wireless access device;

step S302, calculating estimated advantage values of all wireless access devices;

step S303, traversing all wireless access devices, wherein each wireless access device acquires channel state information in the reward and observation values of the wireless access device from the central node, acquires queue state information from the wireless access device, and combines the queue state information to obtain a final observation value of the wireless access device;

step S304, according to the final observation value, each wireless access device locally updates the corresponding strategy parameters by using a gradient descent method;

step S305, the central node updates the network parameters of the dominance function corresponding to each wireless access device by using a gradient descent method;

step S306, adding 1 to the number of rounds, and starting to iteratively execute the training process from step S301 again;

after iteration is carried out for the maximum round times, the algorithm is converged, and the trained strategy network is output.

Further, in step S302, the merit function for calculating the estimated merit value of the wireless access device is:

wherein the time parameter N is 0,1,2, …, N-1; n-1 is the number of time points corresponding to the time length T; gamma, lambda belongs to [0,1 ]]A discounting factor that trades off estimation bias against variance; v_k(S(t)；φ_k) State S (t) of wireless access equipment and neural network parameter phi at time t for wireless access equipment k_kA centralized cost function of; r is_k(t) is a prize for the wireless access device k.

Further, in step S305, the central node updates the minimum loss function of the dominance function network parameter corresponding to each wireless access device by using a gradient descent method;

further, in step S306, the objective function of each wireless access device locally updating the corresponding policy parameter by using the gradient descent method is as follows:

wherein l_k(t；θ_k) Indicating an adjustment control strategy pi_kParameter theta_kLikelihood ratios between the old and new policies; clip (l)_k(t；θ_k) 1- ε,1+ ε) means that_k(t；θ_k) Clipping at 1-epsilon, 1+ epsilon]An interval; epsilon is the error;

is an estimate of the merit function.

The invention has the beneficial effects that:

the invention takes the requirement of the future wireless access network as a starting point, and takes the environment variability and complexity of the future wireless access network into consideration, provides a multi-user intelligent power control method, reduces the time delay and the power consumption of the whole uplink communication system, provides high-quality communication service by using limited resources, and has good realizability and expandability due to low complexity and distributed decision.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a flow chart of a multi-user intelligent transmission power control method in an embodiment of the present invention;

FIG. 2 is a block diagram of multi-agent deep reinforcement learning in an embodiment of the present invention;

FIG. 3 is a flow chart of a multi-agent proximity policy optimization method in an embodiment of the present invention;

FIG. 4 is a pseudo-code example diagram of a multi-agent proximity policy optimization algorithm in an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.

In the communication system in this embodiment, taking uplink communication between the base station and the ground wireless access device as an example, 50 wireless access devices are randomly distributed in an area with a diameter of 1km and perform uplink communication with a single base station, a total available communication bandwidth is 10MHz, the number of OFDMA available subcarriers is 20, and a communication channel path loss is 120.9+37.6log₁₀d (unit is dB), wherein d is the distance between the transmitting end and the receiving end, the Doppler frequency is set to be 10Hz, and the SINR gap is 7.5 dB. The average arrival rate of the data packets is 4Mbps, the maximum transmitting power of the wireless access equipment is 38dBm, the total time step is 1s, the data packets are divided into 1000 time blocks, and the discount coefficients are respectively set to be gamma-0.98 and lambda-0.96. Training was performed for a total of 10000 iterations.

The implementation of the method requires that an environment simulation platform is firstly set up (or in an actual environment) to train and learn the power control strategies of a plurality of wireless access devices. After the algorithm is converged, the trained strategy is applied to an actual wireless access network, and the wireless access equipment is used as an intelligent agent to carry out intelligent power control. Each agent makes intelligent power control decision through collected self user information (queue state information) and partial environment information (self channel state information). Therefore, the long-term low-power-consumption low-time-delay wireless access network multi-user high-quality communication service is realized.

The method for controlling multi-user intelligent transmission power in a radio access network disclosed by the embodiment, as shown in fig. 1, includes the following steps:

step S101, modeling and analyzing a communication system of each wireless access device which is accessed to the network to obtain a global channel state and a global sequence state of the wireless access device;

step S102, determining a power control strategy of each wireless access device based on a Markov decision process of a plurality of individuals; determining an optimization target model of the power control strategy according to the average uplink transmission power consumption and the average uplink communication time delay of the wireless access equipment under the power control strategy;

s103, training the power control strategy by using a multi-agent deep reinforcement learning method to obtain a trained strategy network;

and step S104, each wireless access device carries out intelligent transmission power control according to the trained strategy network.

In this embodiment, the communication service quality of multiple users in the radio access network is optimized, and therefore, in step S101, the modeling analysis performed on the communication system of the radio access device includes:

1) calculating the transmission rate of the wireless access equipment;

each wireless access device which is accessed to the network carries out uplink communication with a single base station in an OFDMA access mode, wherein the number of the allocable subcarriers of the OFDMA is less than the number of the wireless access devices; the OFDMA allows non-orthogonal multiplexing of carriers, piggybacking information of more than one radio access device on the same subcarrier.

Specifically, in the communication system of this embodiment, it is assumed that K intelligent wireless access devices perform uplink communication with a single base station in an OFDMA access manner, where the number of assignable subcarriers of OFDMA is M, in order to better simulate a future large number of wireless accessesIn the case of entering the device, let M < K, and further to reduce the queue latency and improve the spectrum utilization, non-orthogonal multiplexing of carriers is allowed here, which means that information of more than one wireless access device may be carried on the same subcarrier. Suppose that the transmission power of the kth wireless access device on subcarrier m at time t is P_k,m(t) the transmission signal is x_k,m(t) of (d). Then the base station receives at time t and locates at subcarrier m, and the signal from the kth radio access equipment can be represented as:

wherein h is_k,m(t) complex channel coefficient on subcarrier m between wireless access device k and base station at time t, z_k,m(t) is independent and equally distributed complex white Gaussian noise, and the noise power is set to be N₀. Order to

Represents global Channel State Information (CSI), where H_k,_m(t)＝|h_k,m(t)|²Representing the instantaneous channel gain on subcarrier m between wireless access device k and the base station at time t. Here, a rayleigh fading channel model commonly used in a radio access network is adopted, and in order to characterize the dynamic characteristics of a channel, a channel coefficient is expressed as a first-order complex gaussian markov process according to a Jakes fading model:

wherein h is_k,m(t) and channel update procedure e_k,mAnd (t) are all independent and identically distributed unit variance circularly symmetric complex Gaussian random variables. Correlation coefficient ρ ═ J₀(2πf_dT), wherein J₀(.) is a zero-order Bessel function, f_dIs the maximum doppler frequency.

Since multiplexing of subcarriers is allowed here, the base station will receive data from multiple sites on one OFDMA resource blockThe signals of the wireless access devices, for which the signals of one wireless access device and the signals of the other wireless access devices are to be regarded as noise, will also depend on the signal to interference and noise ratio (SINR). Given channel state information H (t) and transmit power

In this case, the achievable data rate of the base station receiving the wireless access device k on the subcarrier m can be represented as:

wherein Γ is the SINR gap due to multiplexing methods such as signal modulation.

2) Carrying out modeling analysis on queue dynamics of the communication wireless access equipment;

in a wireless access network, one of the most intuitive feelings of a wireless access device user to a communication service is in the communication delay, the user requirement is reflected in the communication bottom layer and is the size of a data packet, and the high-quality communication service means that low-delay transmission can be realized and communication resources can be efficiently utilized no matter how the user requirement is. The continuous improvement of the communication rate is finally to meet the requirement of large data volume transmission of users more quickly; if a user needs a small amount of data, power and communication rate are reduced to save power consumption while reducing interference to other users. Therefore, the consideration of the performance index of time delay is added to model and analyze the dynamic information of the data packet queue.

Assuming that the wireless access equipment transmits data packets to enter a sequence to be transmitted randomly in a Poisson distribution process, the average arrival rate of the data packets of the wireless access equipment k is set as lambda_k，I(t)＝(I₁(t),…,I_K(t)) if the amount of packet information arriving at the wireless access device at time t is set to be large, then

Mathematical expectation Ε [ I_k(t)]＝λ_k. Is provided with L_k(t) E [0, ∞) is the length of the sequence to be transmitted by the wireless access device k at time t, and L (t) ═ L₁(t),…,L_K(t))∈[0,∞)^KIs the global sequence state information (QSI). For wireless access device k, its queue dynamics can be expressed as:

after the system environment and state models (i.e., CSI and QSI) are built in step S101, the power control strategy and optimization objective models are designed in step S102, which includes:

1) establishing a power control strategy model

Because the wireless channel environment and the queue dynamics of the wireless access equipment both have Markov properties, and a distributed control strategy is adopted in the process, each wireless access equipment makes an autonomous decision according to partial state information observed by the wireless access equipment, the process of the dynamic decision is modeled into a multi-individual Markov decision process, namely a partially observed Markov game.

Specifically, let S ═ H, L be the global state, and the action set of the radio access device k be

o_kIs an observation set of the wireless access device k, it is assumed here that the wireless access device can observe the self-channel state information H_k,m(t) and queue status information L_k(t) of (d). The wireless access equipment k selects the action according to the random strategy: a is_k(t)～π_k(a_k(t)|o_k(t)), then entering the next state according to the state transition function: s (t +1) -P (S (t +1) | S (t), a)₁(t),…,a_K(t)). Each wireless access device will get a corresponding reward function r_k(t)＝r(S(t),a_k(t), S (t +1)), and obtains the observed quantity o of the new state by itself_k(t + 1). Each wireless access device strives to maximize its own long-term return

Where γ is the discount coefficient and T is the time range.

2) Determining an optimization target model of the power control strategy according to the average uplink transmission power consumption and the average uplink communication time delay of the wireless access equipment under the power control strategy;

from the above modeling we can further set up specific goals and problems faced. First of all, the object of the invention is to reduce the communication power consumption of a wireless access device in a control strategy of pi_kThe average uplink transmission power consumption of the wireless access device k can be expressed as

In addition, the communication time delay of the wireless access equipment is also reduced, and the control strategy is pi_kNext, according to the litter's law, the average uplink communication delay of the wireless access device k can be expressed as

Where T is the time range. According to the mathematical expression and the established low power consumption and low time delay target, the problem of controlling the multi-user intelligent transmitting power in the wireless access network is established as follows:

the problem objective is to minimize the weighted power consumption and the time delay, α_kAnd beta_kRespectively, positive weights corresponding to the power consumption and the time delay of the wireless access device. According to this goal, defining a reward per wireless access device as

Coordination must be formed between wireless access devices to achieve such a cliqueTeam type objects.

Specifically, in step S103, a multi-agent deep reinforcement learning method is applied to obtain an optimal power control strategy for each wireless access device;

the multi-agent deep reinforcement learning technology applied in this embodiment is specifically a multi-agent proximity strategy optimization method, and the overall framework thereof is centralized training and distributed execution, as shown in fig. 2, and an optimal power control strategy is obtained by performing multi-agent deep reinforcement learning based on an actor-decider algorithm.

In order to obtain the optimal power control strategy, the strategy evaluation and the strategy improvement need to be continuously iterated. In a Markov game with multiple agents, the value of a strategy is determined by the global state values and the actions of each agent, so that the strategy is pi_kA centralized evaluation is performed. To reduce the evaluation variance, a general dominance function evaluation strategy is used here, specifically, a centralized cost function defining the strategy adopted by agent k is V^πk(S(t))＝E[R_k|S(t)]The action-value function is

The merit function may be expressed as

In reality, the exact value of the merit function cannot be obtained, and the merit function needs to be estimated by adopting a deep neural network, and the parameters of the merit function network are set as phi ═ phi { (phi ═ phi {)₁,…,φ_KThen the estimate of the merit function can be written as:

wherein, gamma, lambda belongs to [0,1 ]]Discounting factors, δ, for weighing the estimation bias and variance_k(t+n)＝r_k(t+n)+γV_k(S(t+n+1)；φ_k)-V_k(S(t+n)；φ_k) N is a time parameter, representing the point in time to which the policy operates,expansion (8) to obtain:

network parameter phi ═ phi { [ phi ]₁,…,φ_KGet by minimizing the loss function:

the evaluation process of the merit function described above is implemented at a central node (e.g., a wireless access point such as a base station).

Distributed policy improvement can be performed by providing the advantage function required by the evaluation policy and transmitting the advantage function value back to each wireless access device, and the basic idea of the improvement is to adjust the policy parameter theta to be { theta ═ theta₁,…,θ_KTo maximize the objective function J (θ)_k)＝E[R_k]In order to improve the training stability and prevent overlarge change in the strategy training process, the adjacent gradient optimization algorithm changes the objective function into:

wherein the likelihood ratio between old and new policies

clip(l_k(t；θ_k) 1- ε,1+ ε) is_k(t；θ_k) Clipping at 1-epsilon, 1+ epsilon]The interval, ε is the error. The strategy improvement only needs part of the observed value of each wireless access device, so that the strategy improvement can be carried out on the wireless access devices.

More specifically, the present embodiment is a multi-agent proximity policy optimization method implemented based on an actor-decider in a communication system, as shown in fig. 3; the method specifically comprises the following steps:

step S301, operating each wireless access device in each iteration of iteration within time length TPower control strategy of

The central node collects the action, state and reward of each wireless access device to obtain { S (t) }, a₁(t),…,a_K(t), r (t); wherein the initial power control strategy is a random strategy;

the central node is a base station or other wireless access equipment serving as the central node;

the advantage function for calculating the estimated advantage value of the wireless access device is formula (9);

step S303, traversing all wireless access devices, wherein each wireless access device acquires channel state information in the reward and observation values of the wireless access device from a central node, acquires queue state information from the wireless access device, and combines the queue state information to obtain a final observation value of the wireless access device;

step S304, according to the final observation value, each wireless access device locally updates the corresponding strategy parameter theta by using a gradient descent method;

wherein the gradient descent method used locally by each wireless access device is performed according to the objective function of formula (11);

s305, updating the advantage function network parameter phi corresponding to each wireless access device by using a gradient descent method at the central node;

wherein the gradient descent method used by the central node is performed according to the minimum loss function of formula (10);

after iteration is carried out for the maximum round times, the algorithm is converged, the training process is finished, and the trained strategy network is output.

Specifically, in step S104, when each wireless access device performs intelligent transmission power control according to the trained policy network,

and each wireless access device selects the optimal transmitting power to access the wireless communication network according to the trained strategy network and the respective pi (a (t) o (t)) in the complex change environment. At this time, centralized training is not performed any more, and intelligent decision making is performed in a fully distributed manner.

As shown in fig. 4, this embodiment further provides a pseudo code example of the whole multi-agent proximity policy optimization algorithm, and the optimization of the power control policy of the wireless access device for network access is implemented by using the two-layer nesting of for statements.

In summary, the method for controlling multi-user intelligent transmission power in a radio access network of the present embodiment reduces the time delay and power consumption of the entire uplink communication system, provides high-quality communication service using limited resources, and has good realizability and expandability due to low complexity and distributed decision.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A multi-user intelligent transmission power control method in a wireless access network is characterized by comprising the following steps:

2. The transmission power control method according to claim 1, wherein each of the wireless access devices accessing the network performs uplink communication with a single base station in an OFDMA access manner, and the number of allocable subcarriers of the OFDMA is smaller than the number of the wireless access devices; the OFDMA is non-orthogonal multiplexing of carriers, and carries information of more than one radio access device on the same subcarrier.

3. The transmission power control method of claim 2, wherein in the non-orthogonal multiplexing, the achievable data rate of the base station receiving the wireless access device k on the subcarrier m is as follows:

4. The transmission power control method of claim 2, wherein the queue of the wireless access device k on the subcarrier m is dynamic:

I_k(t) is the length of the sequence to be transmitted of the wireless access equipment k at the moment t; c_k,m(t) is the achievable data rate for the base station to receive wireless access device k on subcarrier M, where M is the number of subcarriers.

5. The transmission power control method of claim 1, wherein in step S2, the method is based on a mareThe decision making process is that the wireless access equipment k controls the strategy pi according to the corresponding power_kSelection action a_k(ii) a Entering a next state S (t +1) according to the current state S (t) of the wireless access equipment and the actions of all the wireless access equipment; and, each wireless access device gets a corresponding reward function r during state transition_k(t)＝r(S(t),a_k(t), S (t +1)), and obtains the observed quantity o of the new state by itself_k(t + 1); in the power control strategy, each wireless access device strives to maximize its own long-term return as

Where γ is the discount factor and T is the length of time.

6. The method of claim 5, wherein the optimization objective model of the power control strategy establishes the transmission power control problem of the multiple radio access devices in the radio access network according to the low power consumption and low delay objective as follows:

7. The method of claim 1, wherein the process of training the power control strategy using a multi-agent deep reinforcement learning method comprises:

8. The transmission power control method according to claim 7,

in step S302, the merit function for calculating the estimated merit value of the wireless access device is:

9. The transmission power control method of claim 8, wherein in step S305, the central node updates the minimum loss function of the dominance function network parameter corresponding to each radio access device by using a gradient descent method;

10. the method of claim 9, wherein in step S306, the objective function of each wireless access device locally updating the corresponding policy parameter using a gradient descent method is as follows:

is an estimate of the merit function.