CN114978278B

CN114978278B - Multi-beam giant-constellation satellite frequency and power joint distribution method and device

Info

Publication number: CN114978278B
Application number: CN202210468483.2A
Authority: CN
Inventors: 张海君; 宋葳; 刘向南; 邱宇; 盛敏; 李伟; 隆克平
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2023-04-14
Anticipated expiration: 2042-04-29
Also published as: CN114978278A

Abstract

The invention provides a multi-beam giant-constellation satellite frequency and power joint distribution method and device, and relates to the technical field of wireless communication. The method comprises the following steps: establishing a multi-beam network model of a giant constellation satellite; the method comprises the steps of establishing a satellite network resource allocation problem with maximized frequency band utilization rate, establishing a Markov decision model and establishing a resource optimization problem of spot beams; constructing a deep neural network model according to the resource optimization problem of the spot beam; initializing a deep neural network parameter, and initializing a current value function and a current network Q value; and training the deep neural network model in an off-line manner by using a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing combined distribution of frequency and power on the satellite multi-beam network based on a training result. The method takes the resource allocation optimization problem of each point beam of the satellite as a Markov decision process, and performs resource allocation on the satellite multi-beam network on line based on the intermittent training result so as to make the satellite multi-beam network more stable and fast converge.

Description

Multi-beam giant-constellation satellite frequency and power joint distribution method and device

Technical Field

The present invention relates to the field of wireless communication technologies, and in particular, to a method and an apparatus for jointly allocating frequency and power of a multibeam megaconstellation satellite.

Background

The heaven-earth integrated information network taking the satellite constellation system as an important component is an important network infrastructure for supporting the economic development and the informatization construction of national defense, and becomes the key development direction of new capital construction in China. In order to meet the increase of business requirements of high-speed connection of mass users with no difference in global regions, low-delay information service of ultra-long-distance users and the like, a space-ground integrated large-scale ultra-dense constellation system needs to be designed, and efficient and high-reliability operation of the system is guaranteed. The low-orbit satellite payload requirements for constructing a constellation system are highly flexible and meet the traffic demands of high quality and low cost services by equipping with multi-beam transmit antennas. In order to meet the increasing demand of communication capacity, a mega-constellation multibeam satellite system must break the main bottleneck of bandwidth, which is the capacity limitation, and usually selects frequency reuse in a user link, i.e., a method of using the same subchannel for multiple times for different point beams is used to solve the problem. However, co-channel interference can be generated between co-channel beams, and for a giant constellation scene with densely distributed spot beams, the co-channel interference can be very serious, so that the frequency band utilization rate of the multi-beam satellite system is greatly reduced. As a typical resource-limited system, how to effectively perform reasonable and efficient on-satellite resource allocation so as to reduce co-channel interference is a key problem for improving the utilization rate of a communication system frequency band.

The sub-channel resource and spot beam power allocation of the low orbit satellite is an important means for relieving co-channel interference of a large-scale constellation system. In order to solve the development trends of continuous change, diversification and complication of service scenes in the low-orbit multi-beam satellite, a highly flexible resource allocation algorithm needs to be introduced. Deep reinforcement learning is a reliable method capable of solving the problem of high dynamic property, and is widely applied to the field of resource allocation of wireless communication, but the traditional deep Q learning algorithm and DDPG algorithm can not process the scenes containing discrete variables and continuous variables at the same time, and can not perform discrete sub-channel allocation and continuous spot beam power allocation at the same time; and because of the general single-agent method, the dimensionality of the action space is exponentially increased along with the increase of the number of spot beams, the communication complexity and the calculation cost have similar problems to those of the traditional centralized optimization algorithm, and the multi-agent algorithm uses the rest agents as a part of the environment for interaction instead of using one agent to control the whole action space, so that the multi-agent algorithm is more easily expanded into a giant-constellation network system with a large number of spot beams.

In summary, in the prior art, there are problems of severe inter-cell interference of the same frequency and low utilization rate of system frequency band resources due to intensive deployment of spot beams in a giant constellation satellite system.

Disclosure of Invention

The invention provides a multi-beam giant-constellation satellite frequency and power combined distribution method and device, and aims to solve the problems that in the prior art, point beam intensive deployment in a giant-constellation satellite system causes serious inter-cell interference in the same frequency and low utilization rate of system frequency band resources.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, a multibeam megastar satellite frequency and power joint allocation method is provided, which includes the following steps:

s1: establishing a multi-beam network model of a giant-constellation satellite;

s2: the method comprises the steps of establishing a satellite network resource allocation problem with maximized frequency band utilization rate, and establishing a Markov decision model;

s3: initializing a deep neural network parameter, and initializing a current value function and a current network Q value;

s4: and training the neural network model in an off-line manner by a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing joint distribution of sub-channels and power on the satellite multi-beam network based on a training result.

Optionally, k satellites are set in a constellation system, each satellite uses n beams, the whole system has m subchannels in common, and the gain of the established channel is expressed as

Representing the channel gain at which satellite k arrives at spot beam n, the mth subchannel is used.

Optionally, in step S2, a satellite network resource allocation problem with maximized frequency band utilization rate is established, and a markov decision model is established, including:

s21: combining the interference generated between the same-frequency beams, representing the carrier-to-interference-and-noise ratio of the point beam n at the time slot t, and calculating the frequency band utilization rate of the point beam n in the sub-channel m in the time slot t;

s22: respectively defining an action space and a state space of a deep Q network of a first layer of sub-channel distribution layer and an action space and a state space of an operator-critical network of a second layer of power distribution layer, and establishing a Markov decision model;

s23: and according to the Markov decision model, establishing a resource optimization problem of the spot beam and establishing an objective function.

Optionally, in step S21, in combination with interference generated between co-frequency beams, characterizing a carrier to interference plus noise ratio of the spot beam n at the time slot t, and calculating a frequency band utilization rate of the spot beam n in the sub-channel m in the time slot t, including:

combining the interference generated between the same-frequency beams, the carrier-to-interference-plus-noise ratio of the time-point beam n at the time slot t is characterized according to the following formula (1)

Wherein,

representing the channel selection case, 1 represents the selection of m channels, and 0 represents the non-selection of m channels; k is a radical of _n Represents the serving satellite selected by spot beam n, and G (θ) represents the antenna pattern; g _max Is the maximum antenna gain; theta _l,n Representation based on satellite k _l The angle difference between spot beams l and n; delta ² Represents the noise of the receiver at spot beam user n;

the frequency band utilization C of the spot beam n in the sub-channel m within the time slot t is expressed according to the following formula (2):

optionally, in S22, defining an action space and a state space of the deep Q network of the first layer of subchannel allocation layer and an action-critical network of the second layer of power allocation layer, respectively, and establishing a markov decision model, including:

s221: defining action space and state space of a deep Q network of a first layer of sub-channel distribution layer, initializing a current value function and a current network Q value, and performing distributed execution part of the system, wherein one spot beam is selected to represent an agent, and the state vector of the agent corresponding to the spot beam n in a first layer of time slot t is

Q-target network function of

Wherein psi is the depth Q network super parameter, the action which obtains the maximum Q value is selected>

Represents that the selected channel at time t is pick>

/>

S222: defining action space and state space of operator-critical network of second layer power distribution layer, policy network

Get an action>

The corresponding spot beam transmit power is pick-up>

Using the criticic network of the agent corresponding to the spot beam n to output the value function q (s, mu (s, theta); phi) of the current power selection action;

wherein theta is an actor network hyper-parameter, P _max Achievable for spot beamsAnd the maximum power phi is critical network super parameter.

Optionally, in step S23, according to the markov decision model, a resource optimization problem of a spot beam is established, and an objective function is established, where the method includes:

s231: establishing a resource optimization problem of spot beams, and establishing an objective function as the following formula (3)

Wherein, T _reuse Representing the maximum reuse times of a single sub-channel; n represents a spot beam set, and M represents a subchannel set; the constraint C1 is expressed as the maximum transmit power of each spot beam n is P _max (ii) a The constraint condition C2 indicates that there are only two states of selection and non-selection for each sub-channel at the time slot t; constraint C3 indicates that each point beam n must select a sub-channel as a transmission channel at time slot t; the constraint C4 indicates that each subchannel m can be reused at most by T at time slot T _reuse Second, i.e. sharing a maximum of T _reuse The point beam can have a subchannel m in the time slot t;

s232: the first layer of sub-channel distribution layer and the second layer of power distribution layer select actions according to the feedback reward maximization and circularly accumulate reward values; wherein the sub-channel distribution layer and the power distribution layer share the same reward function

The reward function is composed of the frequency band utilization rate of the self spot beam in the time slot t and a penalty item which generates interference to the rest same-frequency beams.

Optionally, in step S3, initializing a deep neural network parameter, and initializing a current cost function and a current network Q value, including:

the depth Q structure network model parameters are

Wherein it is present>

Representing the number of neurons in an input layer of a deep Q neural network, in combination with a neural network>

Represents the number of hidden layer neurons in the deep Q neural network, and>

indicating the number of neurons in the output layer of the deep Q neural network, psi indicating the hyper-parameter of the deep Q neural network, psi _target Representing and representing a depth Q-target neural network hyper-parameter;

the operator structure network model parameters are

Wherein +>

Expresses the number of neurons in the input layer of the operator-even deep neural network and the operator-target deep neural network, and/or the neuron is selected to be in the neural network>

Represents the number of hidden layer neurons in the operator-even deep neural network and the operator-target deep neural network, and/or the number of hidden layer neurons in the operator-even deep neural network>

Expressing the neuron number of the output layer of the operator-eval deep neural network and the operator-target deep neural network, theta expressing the ultra parameter of the operator-eval deep neural network, theta _target Representing the ultra parameters of the operator-target deep neural network;

the critic structural network model parameters are

Wherein +>

Representing critical-eval deep neural networks and operator-target deep nervesNetwork input layer neuron number->

Representing the number of hidden layer neurons in the critical-eval deep neural network and the critical-target deep neural network, and based on the hidden layer neurons in the neural network and the hidden layer neurons in the bright-eval deep neural network>

Expressing the number of neurons of an output layer of the critical-eval deep neural network and the operator-target deep neural network, phi expressing the super parameter of the critical-eval deep neural network, phi _target The representation represents the critic-target deep neural network hyper-parameter.

Optionally, in step S4, the off-line training of the neural network model by the multi-agent deep reinforcement learning method until the reward converges to the stable value, and performing joint allocation of sub-channels and power to the satellite multi-beam network based on the training result includes:

obtaining instant rewards through a Bellman equation, and developing a deep learning algorithm; the deep learning algorithm comprises a distributed execution part and a centralized training part; the distributed execution part is responsible for selecting new actions according to the Q value, generating new state values through interaction with the environment and storing the new state values into the top-layer memory pool D _subband And a bottom layer memory pool D _power Performing the following steps; the centralized training part is responsible for continuously updating parameters according to the instant rewards until the reward value tends to converge, and then the training and executing part is stopped, the optimal resource allocation scheme is obtained, and the learning is stopped.

In one aspect, there is provided a multibeam mega-constellation satellite frequency and power joint distribution apparatus, for use in an electronic device, adapted for use in the method of any one of the above, the apparatus comprising:

the network model establishing module is used for establishing a multi-beam network model of the giant constellation satellite;

the distribution problem establishing module is used for establishing a satellite network resource distribution problem with the maximized frequency band utilization rate and establishing a Markov decision model;

the parameter initialization module is used for initializing the deep neural network parameters and initializing a current value function and a current network Q value;

and the joint distribution module is used for off-line training the neural network model through a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing joint distribution of sub-channels and power on the satellite multi-beam network based on a training result.

Optionally, the network model building module is configured to set k satellites in a constellation system, each satellite uses n beams, the whole system has m subchannels, and the channel gain is represented as

In one aspect, an electronic device is provided and includes a processor and a memory, where at least one instruction is stored in the memory and loaded by and executed by the processor to implement the one multi-beam mega constellation satellite frequency and power joint allocation method.

In one aspect, a computer-readable storage medium is provided, having stored therein at least one instruction, which is loaded and executed by a processor, to implement the above-mentioned one multi-beam mega-constellation satellite frequency and power joint allocation method.

The technical scheme of the embodiment of the invention at least has the following beneficial effects:

in the above solution, the present invention provides a multi-beam mega constellation satellite frequency and power joint distribution device for large-scale constellation ultra-dense networking, the method is composed of two layers of architectures, which are a sub-channel distribution layer and a spot beam power distribution layer: in the sub-channel distribution layer, due to discrete frequency band distribution, a deep Q learning algorithm is adopted to select the sub-channel of the spot beam; and in the power distribution layer, continuous spot beam power distribution is carried out in the physical layer by adopting a DDPG algorithm. Two layers perform centralized training and distributed execution operations together. The frequency allocation and the transmission power of the spot beam are optimized, and the utilization rate of a system frequency band is maximized. Taking the resource allocation optimization problem of each point beam of the satellite as a Markov decision process; resource allocation is performed on-line for the satellite multi-beam network based on intermittent training results to make it more stable and converge quickly.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flow chart of a multi-beam satellite frequency and power joint allocation according to an embodiment of the present invention;

fig. 2 is a flow chart of a multi-beam satellite frequency and power joint assignment according to an embodiment of the present invention;

fig. 3 is a flowchart of a deep reinforcement learning algorithm for jointly allocating multi-beam satellite frequencies and powers according to an embodiment of the present invention;

fig. 4 is a block diagram of a multi-beam mega-constellation satellite frequency and power joint distribution apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a multi-beam giant-constellation satellite frequency and power joint distribution method, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. As shown in fig. 1, the process flow of the multi-beam mega-constellation satellite frequency and power joint allocation method may include the following steps:

s101: establishing a multi-beam network model of a giant-constellation satellite;

s102: the method comprises the steps of establishing a satellite network resource allocation problem with maximized frequency band utilization rate, and establishing a Markov decision model;

s103: initializing a deep neural network parameter, and initializing a current value function and a current network Q value;

s104: and (3) training the neural network model in an off-line manner by a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing joint distribution of sub-channels and power on the satellite multi-beam network based on a training result.

Optionally, k satellites in a constellation system are set, each satellite uses n beams, the whole system has m subchannels, and the established channel gain is expressed as

Optionally, in step S102, a satellite network resource allocation problem with a maximized frequency band utilization rate is established, and a markov decision model is established, which includes:

s121: combining the interference generated between the same-frequency beams, representing the carrier-to-interference-and-noise ratio of the point beam n in the time slot t, and calculating the frequency band utilization rate of the point beam n in the sub-channel m in the time slot t;

s122: respectively defining an action space and a state space of a deep Q network of a first layer of sub-channel distribution layer and an action space and a state space of an operator-critical network of a second layer of power distribution layer, and establishing a Markov decision model;

s123: and according to the Markov decision model, establishing a resource optimization problem of spot beams and establishing an objective function.

Optionally, in step S121, in combination with the interference generated between the same-frequency beams, characterizing a carrier-to-interference-and-noise ratio of the spot beam n at the time slot t, and calculating a frequency band utilization rate of the spot beam n in the sub-channel m in the time slot t, including:

Wherein,

representing the channel selection case, 1 represents the selection of m channels, and 0 represents the non-selection of m channels; k is a radical of _n Represents the serving satellite selected for spot beam n, and G (θ) represents the antenna pattern; g _max Is the maximum antenna gain; theta.theta. _l,n Representation based on satellite k _l The angle difference between spot beams l and n; delta ² Representing the noise of the receiver at spot beam user n;

the frequency band utilization C of the spot beam n in the sub-channel m in the time slot t is expressed according to the following formula (2):

optionally, in S122, an action space and a state space of the deep Q network of the first layer of subchannel allocation layer and an action-critical network of the second layer of power allocation layer are respectively defined, and the establishing the markov decision model includes:

s1221: defining action space and state space of a deep Q network of a first layer of sub-channel distribution layer, initializing a current value function and a current network Q value, and performing distributed execution part of the system, wherein one spot beam is selected to represent an agent, and the state vector of the agent corresponding to the spot beam n in a first layer of time slot t is

Q-target network function of

Represents that the selected channel at time t is pick>

S1222: defining action space and state space of operator-critical network of second layer power distribution layer, policy network

Get the action->

Then the corresponding spot beam transmit power is &>

Using the critic network of the agent corresponding to the spot beam n to output the value function q (s, mu (s, theta); phi) of the current power selection action;

wherein theta is an actor network hyper-parameter, P _max Phi is the critical network super-parameter for the maximum power that the spot beam can reach.

Optionally, in step S123, establishing a resource optimization problem of the spot beam according to the markov decision model, and establishing an objective function, including:

s1231: establishing a resource optimization problem of spot beams, and establishing an objective function as the following formula (3)

Wherein, T _reuse Representing the maximum reuse times of a single subchannel; n represents a spot beam set, and M represents a subchannel set; the constraint C1 is expressed as the maximum transmit power P of each spot beam n _max (ii) a The constraint C2 is expressed asIn the time slot t, each sub-channel only has two states of selection and non-selection; the constraint condition C3 indicates that each point beam n must select a sub-channel as a transmission channel at time slot t; the constraint C4 indicates that each subchannel m can be reused at most by T at time slot T _reuse Second, i.e. sharing a maximum of T _reuse The spot beam can have a subchannel m in the time slot t;

s1232: the first layer of sub-channel distribution layer and the second layer of power distribution layer select actions according to the feedback reward maximization and circularly accumulate reward values; wherein the sub-channel distribution layer and the power distribution layer share the same reward function

Optionally, in step S103, initializing deep neural network parameters, and initializing a current cost function and a current network Q value, including:

the depth Q structure network model parameters are

Wherein +>

the operator structure network model parameters are

Wherein +>

Represents the number of neurons in the input layer of the operator-eval deep neural network and the operator-target deep neural network, and/or the combination of the neurons in the input layer and the neurons in the input layer>

the critic structural network model parameters are

Wherein +>

Represents the number of neurons in the input layer of the critic-eval deep neural network and the actor-target deep neural network, and then the neurons are/is>

Representing the number of hidden layer neurons in the critic-eval deep neural network and the critic-target deep neural network, and selecting the hidden layer neurons in the critic-eval deep neural network as the hidden layer neurons>

Expressing the number of neurons of an output layer of the critical-eval deep neural network and the operator-target deep neural network, phi expressing the super parameter of the critical-eval deep neural network, phi _target The representation represents the critic-target deep neural network hyperparameter.

Optionally, in step S104, the off-line training of the neural network model by the multi-agent deep reinforcement learning method until the reward converges to the stable value, and performing joint distribution of sub-channels and power to the satellite multi-beam network based on the training result includes:

obtaining instant rewards through a Bellman equation, and developing a deep learning algorithm; the deep learning algorithm comprises a distributed execution part and a centralized training part; the distributed execution part is responsible for selecting new actions according to the Q value, generating new state values through interaction with the environment and storing the new state values into the top-layer memory pool D _subband And a bottom memory pool D _power Performing the following steps; the centralized training part is responsible for continuously updating parameters according to the instant rewards, and when the reward value tends to converge, the training and executing part is stopped, the optimal resource allocation scheme is obtained at the moment, and the learning is stopped.

In the embodiment of the invention, a multi-beam giant-constellation satellite frequency and power joint distribution device for a large-scale constellation ultra-dense networking is provided, and the method consists of two layers of frameworks which are respectively a sub-channel distribution layer and a spot beam power distribution layer: in the sub-channel distribution layer, due to discrete frequency band distribution, a deep Q learning algorithm is adopted to select the sub-channel of the spot beam; and in the power distribution layer, continuous spot beam power distribution is carried out in the physical layer by adopting a DDPG algorithm. Both layers perform centralized training and distributed execution operations together. The frequency allocation and the transmission power of spot beams are optimized, and the utilization rate of a system frequency band is maximized. Taking the resource allocation optimization problem of each point beam of the satellite as a Markov decision process; resource allocation is performed on-line for the satellite multi-beam network based on intermittent training results to make it more stable and converge quickly.

The embodiment of the invention provides a multi-beam giant-constellation satellite frequency and power joint distribution method, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. As shown in fig. 2, the process flow of the multi-beam mega-constellation satellite frequency and power joint allocation method may include the following steps:

s201: and establishing a multi-beam network model of the giant-constellation satellite.

In a possible embodiment, the invention first constructs a spot beam densityAnd (4) setting a deployed giant constellation network scene. Setting K satellites in a constellation system, wherein a satellite set is represented as K = {1,2,3 … K }; n spot beams are used in each satellite, the spot beam set denoted as N = {1,2,3 … N }; the total system has M subchannels set denoted as M = {1,2,3 … M }; typically n > m; l is the number of times each channel can be reused; the gain of the channel is established as

Representing the channel gain at which satellite k arrives at spot beam n, the mth subchannel is used. Wherein +>

t＝1,2,3…。

β _k→n Represents the large-scale fading from the satellite k to the ground point beam n, and can be known according to the Fris formula

Wherein G is _t Denotes the transmission gain, G _r Represents the receive gain, L represents a system loss factor independent of the propagation environment, d represents the antenna-to-ground terminal distance, and->

Representing small scale rayleigh fading.

S202: and (3) combining the interference generated among the same-frequency beams, representing the carrier-to-interference-and-noise ratio of the point beam n in the time slot t, and calculating the frequency band utilization rate of the point beam n in the sub-channel m in the time slot t.

In one possible implementation, the carrier-to-interference and noise ratio of the point beam n at the time slot t is characterized according to the following formula (1) in combination with the interference generated between the same-frequency beams

/>

Wherein,

representing the channel selection case, 1 represents the selection of m channels, and 0 represents the non-selection of m channels; k is a radical of formula _n Representing the selected serving satellite of spot beam n, k _n E is K, G (theta) represents an antenna directional diagram; g _max Is the maximum antenna gain; theta _l,n Representation based on satellite k _l The angle difference between spot beams l and n; delta ² Representing the noise of the receiver at spot beam n.

s203: and respectively defining the action space and the state space of a deep Q network of a first layer of sub-channel distribution layer and the action space and the state space of an operator-critical network of a second layer of power distribution layer, and establishing a Markov decision model.

In a possible implementation manner, an action space and a state space of a deep Q network of a first layer of sub-channel allocation layer are defined, a current cost function and a current network Q value are initialized, and a distributed execution part of the system is performed, wherein one spot beam is selected to represent an agent, and a state vector of the agent corresponding to the spot beam n in a first layer of time slot t is

The Q-target network function is ^ greater than>

Wherein psi is the depth Q network super parameter, and the action with the maximum Q value is selected

Represents at tThe selected channel is->

Defining action space and state space of operator-critical network of second layer power distribution layer, policy network

Get the action->

The corresponding spot beam transmit power is pick-up>

where θ is an actor network hyper-parameter, P _max Phi is the critical network super-parameter for the maximum power that the spot beam can reach.

In one possible implementation, the spot beam n sub-channel allocation in the time slot t is:

then all the spot beam sub-channels within the time slot t are allocated as: device for selecting or keeping>

Power allocation situation bit of each spot beam in time slot t:

s204: and according to the Markov decision model, establishing a resource optimization problem of spot beams and establishing an objective function.

In one possible embodiment, the resource optimization problem for spot beams is established by establishing an objective function as shown in equation (3) below

Wherein, T _reuse Representing the maximum reuse times of a single sub-channel; n represents a spot beam set, and M represents a subchannel set; the constraint C1 is expressed as the maximum transmit power P of each spot beam n _max (ii) a The constraint condition C2 indicates that there are only two states of selection and non-selection for each sub-channel at the time slot t; the constraint condition C3 indicates that each point beam n must select a sub-channel as a transmission channel at time slot t; the constraint C4 indicates that each subchannel m can be reused at most by T at time slot T _reuse Second, i.e. sharing a maximum T _reuse The spot beams may share a subchannel m in time slot t.

In a possible embodiment, in order to characterize the influence of co-channel interference on the multi-satellite multi-beam system after sub-channel allocation, two sets of parameters are further defined:

parameter 1:

represents all remaining spot beams n using the same subchannel as spot beam n in the time slot t sorting the strength of the interference, i.e. sorting according to>

Parameter 2:

representing a set of spot beams using m sub-channels in the vicinity of a spot beam n within a time slot t, i.e. spot beams

The ordering of the sets is according to the time slot t-1Sorting the significance of interference spot beam interference on neighboring spot beams>

S205: the first layer of sub-channel distribution layer and the second layer of power distribution layer select actions according to the feedback reward maximization and circularly accumulate reward values; wherein the sub-channel distribution layer and the power distribution layer share the same reward function

In one possible implementation, the present invention employs a multi-agent deep reinforcement learning network for resource allocation to maximize the objective function. Each satellite spot beam is used as an agent, the rest agents are used as part of the environment for centralized training, and in order to describe the influence of action selection of each agent on the rest agents, each agent adopts the same reward function

Representing all the set of spot beams that are interfered by spot beam n.

In a possible implementation, an action space and a state space of a deep learning network are required. The algorithm needs to set two actions of allocating sub-channels and allocating power to the sub-channels, and defines the allocated sub-channels as

Defining an assignment of power to a sub-channel to be pick>

Set->

Then the spot beam transmit power is asserted to pick>

P _max Is a constant, and refers to the maximum transmit power that can be achieved by the spot beam.

The state space is set to:

where p represents the transmit power of the spot beam n. The network needs to consider both sub-channel allocation and power control for the whole

While considering the utilization rate of the spot beam n frequency band, a penalty term needs to be set to indicate the influence of the spot beam n on the co-channel interference of the rest beams, so that the reward function is set as follows: />

Wherein,

a penalty term represents the influence of the spot beam n on all other spot beams with the same frequency, and the penalty principle is as follows: the larger the share ratio of the co-channel interference value generated by the spot beam in all interference power borne by the interfered beam is, the larger the penalty item is, and the smaller the corresponding reward is.

The deep reinforcement learning algorithm firstly needs to initialize the following parameters:

s206: initializing deep neural network parameters, and initializing a current cost function and a current network Q value.

In a possible implementation, the deep Q-structured network model is taken into accountIs numbered as

Wherein +>

Representing the number of hidden layer neurons in a deep Q neural network, in combination with a neural network>

the operator structure network model parameters are

Wherein it is present>

Expressing the number of neurons of output layers of the operator-eval deep neural network and the operator-target deep neural network, expressing theta as ultra-parameter of the operator-eval deep neural network, and expressing theta as parameter _target Representing the ultra parameters of the operator-target deep neural network;

the critic structural network model parameters are

Wherein it is present>

Represents the number of neurons in input layers of a critical-eval deep neural network and an operator-target deep neural network, and/or>

Expressing the number of neurons of the output layer of the critic-eval deep neural network and the actor-target deep neural network, phi expressing the hyperparameter of the critic-eval deep neural network, phi _target The representation represents the critic-target deep neural network hyperparameter.

In one possible implementation, each satellite spot beam acts as an agent, and consists of a deep Q network for configuring actions to select sub-channel behavior and output its Q-value on the top layer, and an operator-critic structure for selecting strategy configuration spot beam power and generating evaluation functions on the bottom layer.

S07: and training the neural network model in an off-line manner by a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing joint distribution of sub-channels and power on the satellite multi-beam network based on a training result.

In a feasible implementation mode, instant rewards are obtained through a Bellman equation, and a deep learning algorithm is developed; the deep learning algorithm comprises a distributed execution part and a centralized training part; the distributed execution part is responsible for selecting new actions according to the Q value, generating new state values through interaction with the environment and storing the new state values into the top-layer memory pool D _subband And a bottom memory pool D _power The preparation method comprises the following steps of (1) performing; the centralized training part is responsible for continuously updating the parameters according to the instant rewards until the reward value tends to convergeAnd terminating the training and executing part to obtain the optimal resource allocation scheme at the moment, and terminating the learning.

In one possible embodiment, the neural network model is trained offline until the reward converges to a stable value, as shown in fig. 3, and the specific steps can be divided into two parts, namely distributed execution and centralized training, wherein the distributed execution system is as follows:

(1) From the first layer system subchannel allocation layer, the state vector of the agent time slot t corresponding to the spot beam n

(2) Inputting the state information into the Q-target network according to

ψ _arent Obtaining the action with the maximum q value

While setting the corresponding sub-channel selection to +>

(3) Will turn over

Memory pool D stored in subchannel training system _subband In (1).

(4) The second layer system beam power distribution layer obtains the state vector of the agent time slot t corresponding to the spot beam n from the sub-channel distribution layer as

(corresponding to not needing to know all subchannel information at the bottom layer, but only the decision made at time t.)

(5) Entering state information into critical-target network in the second tier system, according to policy network

Get the action->

The corresponding spot beam transmit power is pick-up>

(6) Using the critic-target network of the agent to which the spot beam n corresponds to output a value function of the current power selection action

(7) Two actions obtained at time t

And/or>

Inputting the data into a wireless satellite system to obtain a new round of environmental state s ^(t+1) 。

The centralized training is as follows:

a. defining the transmission delay of the system as 1 time slot, in the centralized training part of the top layer, in the time slot t, the intelligent slave memory pool D corresponding to the point beam n _subband The latest experience that can be obtained in (1) is expressed as

To D _subband Sampling is carried out to obtain a mini-batch sample B _s Then the mean square bellman error of the target Q network is expressed as: l (psi, D) = E _{(s,a,r′,s′)～D} [(y(r′,s′)-q(s,a；ψ _target )) ² ]Wherein y (r ', s') = r '+ γ maxq' (s ', a'; ψ) _target ) And gamma represents a discount coefficient.

b. The gradient descent method is used for minimizing the Bellman error function as follows:

updating the depth Q network parameter psi of the agent corresponding to the spot beam n and every T _u Time slot updating target Q network parameter psi _target ←ψ。

c. In the bottom centralized training part, the slave memory pool D of the agent corresponding to the point beam n is in the time slot t _subband The latest experience that can be obtained in (1) is expressed as

To D _power Sampling to obtain a mini-batch sample B _p Then the mean square bellman error of the critical-target network is expressed as: l (φ, D) = E _{(s,a,r′,s′)～D} [(y _critic (r′ _n,m ,s′ _n,m )-q(s _n,m ,a _n,m ；φ)) ² ]Wherein y is _critic (r′,s′)＝r′+ζmaxq′(s′,a′；φ _target ) And ζ represents the discount coefficient. />

d. The gradient descent method is used for minimizing the Bellman error function as follows:

updating critic-eval network parameter phi of the agent corresponding to the spot beam n and every T _u Time slot updating parameter phi of target Q network _target ←φ。

e. The next update of the operator-target network is made, and since it is a deterministic policy, its loss function is expressed as L (θ, D) = E _{(s,a,r′,s′)～D} [q(s,μ(s；θ)；φ)]The minimization of the loss function according to the gradient ascent strategy is expressed as:

updating the operator-eval network parameter theta of the agent corresponding to the spot beam n and every T _u Time slot updating parameter theta of target Q network _target ←θ。

And finally, performing sub-channel and frequency resource allocation on the satellite multi-beam network on line based on an off-line training result through the distributed execution and the centralized training, so as to maximize the network frequency band utilization rate.

The embodiment of the invention provides a multi-beam satellite frequency and power combined distribution device for a large-scale constellation ultra-dense networking, which consists of two layers of frameworks, namely a sub-channel distribution layer and a spot beam power distribution layer: in the sub-channel distribution layer, due to discrete frequency band distribution, a deep Q learning algorithm is adopted to select the sub-channel of the spot beam; and in the power distribution layer, continuous spot beam power distribution is carried out on the physical layer by adopting a DDPG algorithm. Two layers perform centralized training and distributed execution operations together. The frequency allocation and the transmission power of the spot beam are optimized, and the utilization rate of a system frequency band is maximized. Taking the resource allocation optimization problem of each point beam of the satellite as a Markov decision process; resource allocation is performed on-line for the satellite multi-beam network based on intermittent training results to make it more stable and converge quickly.

Fig. 4 is a block diagram of a multi-beam mega-constellation satellite frequency and power joint distribution apparatus, according to an exemplary embodiment. The apparatus is suitable for use in any of the above methods. Referring to fig. 4, the apparatus 300 includes:

the network model establishing module 310 is used for establishing a network model of a giant constellation satellite multi-beam;

an allocation problem establishing module 320, configured to establish a satellite network resource allocation problem with a maximized frequency band utilization rate, and establish a markov decision model;

a parameter initialization module 330, configured to initialize deep neural network parameters, and initialize a current cost function and a current network Q value;

and the joint distribution module 340 is configured to train the neural network model offline by using a multi-agent deep reinforcement learning method until the reward converges to a stable value, and perform joint distribution of sub-channels and power to the satellite multi-beam network based on a training result.

Optionally, the network model establishing module 310 is further configured to set k satellites in one constellation system, n beams are used in each satellite, the entire system has m subchannels in total, and L is the number of times that each channel can be reused; the gain of the channel is established as

Optionally, the allocation problem establishing module 320 is configured to combine interference generated between co-frequency beams, characterize a carrier-to-interference-plus-noise ratio of the spot beam n at the time slot t, and calculate a frequency band utilization rate of the spot beam n in the sub-channel m in the time slot t;

respectively defining an action space and a state space of a deep Q network of a first layer of sub-channel distribution layer and an action space and a state space of an operator-critical network of a second layer of power distribution layer, and establishing a Markov decision model;

and according to the Markov decision model, establishing a resource optimization problem of spot beams and establishing an objective function.

Optionally, the assignment problem establishing module 320 is configured to combine the interferences generated between the co-channel beams and characterize the carrier-to-interference and noise ratio of the point beam n at the time slot t according to the following formula (1)

Wherein,

representing the channel selection case, 1 represents the selection of m sub-channels, and 0 represents the non-selection of m channels; k is a radical of formula _n Represents the serving satellite selected by spot beam n, and G (θ) represents the antenna pattern; g _max Is the maximum antenna gain; theta _l,n Representation based on satellite k _l The angle difference between the spot beams l and n; delta ² Representing the noise of the receiver at spot beam n;

optionally, the allocation problem establishing module 320 is configured to define an action space and a state space of a deep Q network of a first layer of sub-channel allocation layer, initialize a current cost function and a current network Q value, and perform a distributed execution part of the system, where one spot beam is selected to represent an agent, and a state vector of the agent corresponding to the spot beam n in the first layer of time slot t is

The Q-target network function is ^ greater than>

Wherein psi is a deep Q network super parameter, and selecting the action ^ which obtains the maximum Q value>

Represents that the selected channel at time t is pick>

Get an action>

Then the corresponding spot beam transmit power is &>

Optionally, the assignment problem establishing module 320 is further configured to establish a resource optimization problem for the spot beam, establishing an objective function as the following formula (3)

Wherein, T _reuse Representing the maximum reuse times of a single subchannel; n represents a spot beam set, and M represents a subchannel set; the constraint C1 is expressed as the maximum transmit power P of each spot beam n _max (ii) a The constraint condition C2 indicates that there are only two states of selection and non-selection for each subchannel at the time slot t; constraint C3 indicates that each point beam n must select a sub-channel as a transmission channel at time slot t; the constraint C4 indicates that each subchannel m can be reused at most by T at time slot T _reuse Second, i.e. sharing a maximum T _reuse The spot beam can have a subchannel m in the time slot t;

the first layer of sub-channel distribution layer and the second layer of power distribution layer select actions according to the feedback reward maximization and circularly accumulate reward values; wherein the sub-channel distribution layer and the power distribution layer share the same reward function

The reward function consists of the frequency band utilization rate of the self point beam in the time slot t and a penalty item for generating interference on the rest same-frequency beams.

Optionally, a parameter initialization module 330 for the deep Q-structure network model parameters of

Wherein it is present>

Represents the number of neurons in the input layer of the deep Q neural network, and is/are>

indicating the number of neurons in the output layer of the deep Q neural network, psi indicating the hyperparameter of the deep Q neural network, psi _target Representing and representing a depth Q-target neural network hyper-parameter;

the operator structure network model parameters are

Wherein it is present>

the critic structural network model parameters are

Wherein it is present>

Optionally, the joint allocation module 340 is configured to obtain an instant reward through a bellman equation, and develop a deep learning algorithm; the deep learning algorithm comprises a distributed execution part and a centralized training part; the distributed execution part is responsible for selecting a new action according to the Q value, generating a new state value by interacting with the environment and storing the new state value into a top-layer memory pool D _subband And a bottom memory pool D _power The preparation method comprises the following steps of (1) performing; the centralized training part is responsible for continuously updating parameters according to the instant rewards until the reward value tends to converge, and then the training and executing part is stopped, the optimal resource allocation scheme is obtained, and the learning is stopped.

Fig. 5 is a schematic structural diagram of an electronic device 400 according to an embodiment of the present invention, where the electronic device 400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to implement the following multi-beam mega constellation satellite frequency and power joint allocation method steps:

s1: establishing a multi-beam network model of a giant constellation satellite;

s2: according to the network model, a satellite network resource allocation problem with the maximized frequency band utilization rate is established, a Markov decision model is established, and a resource optimization problem of spot beams is established;

s3: constructing a deep neural network model according to the resource optimization problem of the spot beams; initializing a deep neural network parameter, and initializing a current value function and a current network Q value;

s4: and training the deep neural network model in an off-line manner by using a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing combined distribution of frequency and power on the satellite multi-beam network based on a training result.

In an exemplary embodiment, a computer-readable storage medium, such as a memory including instructions executable by a processor in a terminal, to perform the course learning based multi-beam mega constellation satellite frequency and power joint allocation method described above, is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A multibeam giant-constellation satellite frequency and power joint distribution method is characterized by comprising the following steps:

s1: establishing a multi-beam network model of a giant constellation satellite;

2. The method according to claim 1, wherein the parameters of the network model of giant constellation satellite multi-beams comprise: setting k satellites in a constellation system, wherein each satellite uses n wave beams, and the whole system has m sub-channels; the gain of the channel is established as

Representing the channel gain at time slot t for satellite k to arrive at spot beam n, the mth subchannel is used.

3. The method of claim 2, wherein in step S2, establishing a satellite network resource allocation problem with maximized frequency band utilization, establishing a markov decision model, and establishing a resource optimization problem for spot beams comprises:

4. The method according to claim 3, wherein in step S21, in combination with the interference generated between the co-channel beams, characterizing the carrier-to-interference-and-noise ratio of the spot beam n at the time slot t, and calculating the frequency band utilization rate of the spot beam n in the sub-channel m in the time slot t, includes:

Wherein,

representing the channel selection case, 1 represents the selection of m channels, and 0 represents the non-selection of m channels; k is a radical of _n Represents the serving satellite selected by spot beam n, and G (θ) represents the antenna pattern; g _max Is the maximum antenna gain; theta _l,n Representation based on satellite k _l The angle difference between the spot beams l and n; delta. For the preparation of a coating ² Represents the noise of the receiver at spot beam user n; />

Transmitting power for the corresponding spot beam;

5. the method of claim 3, wherein the step S22 of defining the action space and the state space of the deep Q network of the first layer of sub-channel allocation layer and the action space and the state space of the operator-critical network of the second layer of power allocation layer respectively to build the Markov decision model comprises:

s221: defining action space and state space of a deep Q network of a first layer of sub-channel distribution layer, initializing a current cost function and a current network Q value, and performing distributed execution part of the system, wherein one spot beam is selected to represent an agent, and the state vector of the agent corresponding to the spot beam n in a first layer time slot t is

Q-target network function of

Represents that the selected channel at time t is pick>

Get an action>

The corresponding spot beam transmit power is pick-up>

wherein theta is an actor network hyper-parameter, P _max Phi is the maximum power that the spot beam can reach, and phi is the critical network hyper-parameter.

6. The method of claim 5, wherein in step S23, establishing a resource optimization problem for spot beams according to the Markov decision model, and establishing an objective function comprises:

establishing a resource optimization problem of the spot beam, and establishing an objective function as the following formula (3)

Wherein, T _reuse Representing the maximum reuse times of a single subchannel; n represents a spot beam set, and M represents a subchannel set; the constraint C1 is expressed as the maximum transmit power of each spot beam n is P _max (ii) a The constraint condition C2 indicates that there are only two states of selection and non-selection for each sub-channel at the time slot t; the constraint condition C3 indicates that each point beam n must select a sub-channel as a transmission channel at time slot t; the constraint C4 indicates that each subchannel m can be reused at most by T at time slot T _reuse Second, i.e. sharing a maximum T _reuse The spot beam can have a subchannel m in the time slot t;

The reward function is formed by the self point beam in the time slot tAnd a penalty term for causing interference to the rest of the same-frequency beams.

7. The method according to claim 1, wherein in step S3, initializing deep neural network parameters, initializing current cost function and current network Q value, comprises:

the depth Q structure network model parameters are

Wherein it is present>

indicating the number of neurons in the output layer of the deep Q neural network, psi indicating the hyperparameter of the deep Q neural network, psi _target Representing the super parameter of the neural network representing the depth Q-target; />

The operator structure network model parameters are

Wherein it is present>

the critic structural network model parameters are

Wherein it is present>

8. The method according to claim 1, wherein in step S4, the neural network model is trained offline through a multi-agent deep reinforcement learning method until the reward converges to a stable value, and the joint allocation of sub-channels and power is performed on the satellite multi-beam network based on the training result, comprising:

obtaining instant rewards through a Bellman equation, and developing a deep learning algorithm; the deep learning algorithm comprises a distributed execution part and a centralized training part; the distributed execution part is responsible for selecting a new action according to the Q value, generating a new state value by interacting with the environment and storing the new state value into a top-layer memory pool D _subband And a bottom layer memory pool D _power The preparation method comprises the following steps of (1) performing; the centralized training part is responsible for continuously updating parameters according to the instant rewards until the reward value tends to converge, and then the training and executing part is stopped, the optimal resource allocation scheme is obtained, and the learning is stopped.

9. A multi-beam mega constellation satellite frequency and power joint distribution apparatus, adapted for use in the method of any of the preceding claims 1-8, the apparatus comprising:

the network model building module is used for building a multi-beam network model of a giant constellation satellite;

the parameter initialization module is used for initializing the deep neural network parameters and initializing the current value function and the current network Q value;

10. The apparatus of claim 9, wherein the network model building module is configured to set k satellites in a constellation system, each satellite having n beams, the entire system having m subchannels, and wherein the channel gain is expressed as

Representing the channel gain at which satellite k arrives at spot beam n, the mth subchannel is used. />