CN114978278B - Multi-beam giant-constellation satellite frequency and power joint distribution method and device - Google Patents

Multi-beam giant-constellation satellite frequency and power joint distribution method and device Download PDF

Info

Publication number
CN114978278B
CN114978278B CN202210468483.2A CN202210468483A CN114978278B CN 114978278 B CN114978278 B CN 114978278B CN 202210468483 A CN202210468483 A CN 202210468483A CN 114978278 B CN114978278 B CN 114978278B
Authority
CN
China
Prior art keywords
network
neural network
layer
satellite
deep neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210468483.2A
Other languages
Chinese (zh)
Other versions
CN114978278A (en
Inventor
张海君
宋葳
刘向南
邱宇
盛敏
李伟
隆克平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202210468483.2A priority Critical patent/CN114978278B/en
Publication of CN114978278A publication Critical patent/CN114978278A/en
Application granted granted Critical
Publication of CN114978278B publication Critical patent/CN114978278B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B7/00Radio transmission systems, i.e. using radiation field
    • H04B7/14Relay systems
    • H04B7/15Active relay systems
    • H04B7/185Space-based or airborne stations; Stations for satellite systems
    • H04B7/1851Systems using a satellite or space-based relay
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B7/00Radio transmission systems, i.e. using radiation field
    • H04B7/14Relay systems
    • H04B7/15Active relay systems
    • H04B7/185Space-based or airborne stations; Stations for satellite systems
    • H04B7/1851Systems using a satellite or space-based relay
    • H04B7/18519Operations control, administration or maintenance
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/30TPC using constraints in the total amount of available transmission power
    • H04W52/34TPC management, i.e. sharing limited amount of power among users or channels or data types, e.g. cell loading
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0453Resources in frequency domain, e.g. a carrier in FDMA
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/046Wireless resource allocation based on the type of the allocated resource the resource being in the space domain, e.g. beams
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Astronomy & Astrophysics (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • General Physics & Mathematics (AREA)
  • Radio Relay Systems (AREA)

Abstract

The invention provides a multi-beam giant-constellation satellite frequency and power joint distribution method and device, and relates to the technical field of wireless communication. The method comprises the following steps: establishing a multi-beam network model of a giant constellation satellite; the method comprises the steps of establishing a satellite network resource allocation problem with maximized frequency band utilization rate, establishing a Markov decision model and establishing a resource optimization problem of spot beams; constructing a deep neural network model according to the resource optimization problem of the spot beam; initializing a deep neural network parameter, and initializing a current value function and a current network Q value; and training the deep neural network model in an off-line manner by using a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing combined distribution of frequency and power on the satellite multi-beam network based on a training result. The method takes the resource allocation optimization problem of each point beam of the satellite as a Markov decision process, and performs resource allocation on the satellite multi-beam network on line based on the intermittent training result so as to make the satellite multi-beam network more stable and fast converge.

Description

Multi-beam giant-constellation satellite frequency and power joint distribution method and device
Technical Field
The present invention relates to the field of wireless communication technologies, and in particular, to a method and an apparatus for jointly allocating frequency and power of a multibeam megaconstellation satellite.
Background
The heaven-earth integrated information network taking the satellite constellation system as an important component is an important network infrastructure for supporting the economic development and the informatization construction of national defense, and becomes the key development direction of new capital construction in China. In order to meet the increase of business requirements of high-speed connection of mass users with no difference in global regions, low-delay information service of ultra-long-distance users and the like, a space-ground integrated large-scale ultra-dense constellation system needs to be designed, and efficient and high-reliability operation of the system is guaranteed. The low-orbit satellite payload requirements for constructing a constellation system are highly flexible and meet the traffic demands of high quality and low cost services by equipping with multi-beam transmit antennas. In order to meet the increasing demand of communication capacity, a mega-constellation multibeam satellite system must break the main bottleneck of bandwidth, which is the capacity limitation, and usually selects frequency reuse in a user link, i.e., a method of using the same subchannel for multiple times for different point beams is used to solve the problem. However, co-channel interference can be generated between co-channel beams, and for a giant constellation scene with densely distributed spot beams, the co-channel interference can be very serious, so that the frequency band utilization rate of the multi-beam satellite system is greatly reduced. As a typical resource-limited system, how to effectively perform reasonable and efficient on-satellite resource allocation so as to reduce co-channel interference is a key problem for improving the utilization rate of a communication system frequency band.
The sub-channel resource and spot beam power allocation of the low orbit satellite is an important means for relieving co-channel interference of a large-scale constellation system. In order to solve the development trends of continuous change, diversification and complication of service scenes in the low-orbit multi-beam satellite, a highly flexible resource allocation algorithm needs to be introduced. Deep reinforcement learning is a reliable method capable of solving the problem of high dynamic property, and is widely applied to the field of resource allocation of wireless communication, but the traditional deep Q learning algorithm and DDPG algorithm can not process the scenes containing discrete variables and continuous variables at the same time, and can not perform discrete sub-channel allocation and continuous spot beam power allocation at the same time; and because of the general single-agent method, the dimensionality of the action space is exponentially increased along with the increase of the number of spot beams, the communication complexity and the calculation cost have similar problems to those of the traditional centralized optimization algorithm, and the multi-agent algorithm uses the rest agents as a part of the environment for interaction instead of using one agent to control the whole action space, so that the multi-agent algorithm is more easily expanded into a giant-constellation network system with a large number of spot beams.
In summary, in the prior art, there are problems of severe inter-cell interference of the same frequency and low utilization rate of system frequency band resources due to intensive deployment of spot beams in a giant constellation satellite system.
Disclosure of Invention
The invention provides a multi-beam giant-constellation satellite frequency and power combined distribution method and device, and aims to solve the problems that in the prior art, point beam intensive deployment in a giant-constellation satellite system causes serious inter-cell interference in the same frequency and low utilization rate of system frequency band resources.
In order to solve the technical problems, the invention provides the following technical scheme:
in one aspect, a multibeam megastar satellite frequency and power joint allocation method is provided, which includes the following steps:
s1: establishing a multi-beam network model of a giant-constellation satellite;
s2: the method comprises the steps of establishing a satellite network resource allocation problem with maximized frequency band utilization rate, and establishing a Markov decision model;
s3: initializing a deep neural network parameter, and initializing a current value function and a current network Q value;
s4: and training the neural network model in an off-line manner by a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing joint distribution of sub-channels and power on the satellite multi-beam network based on a training result.
Optionally, k satellites are set in a constellation system, each satellite uses n beams, the whole system has m subchannels in common, and the gain of the established channel is expressed as
Figure BDA0003625555270000021
Representing the channel gain at which satellite k arrives at spot beam n, the mth subchannel is used.
Optionally, in step S2, a satellite network resource allocation problem with maximized frequency band utilization rate is established, and a markov decision model is established, including:
s21: combining the interference generated between the same-frequency beams, representing the carrier-to-interference-and-noise ratio of the point beam n at the time slot t, and calculating the frequency band utilization rate of the point beam n in the sub-channel m in the time slot t;
s22: respectively defining an action space and a state space of a deep Q network of a first layer of sub-channel distribution layer and an action space and a state space of an operator-critical network of a second layer of power distribution layer, and establishing a Markov decision model;
s23: and according to the Markov decision model, establishing a resource optimization problem of the spot beam and establishing an objective function.
Optionally, in step S21, in combination with interference generated between co-frequency beams, characterizing a carrier to interference plus noise ratio of the spot beam n at the time slot t, and calculating a frequency band utilization rate of the spot beam n in the sub-channel m in the time slot t, including:
combining the interference generated between the same-frequency beams, the carrier-to-interference-plus-noise ratio of the time-point beam n at the time slot t is characterized according to the following formula (1)
Figure BDA0003625555270000031
Figure BDA0003625555270000032
Wherein,
Figure BDA0003625555270000033
representing the channel selection case, 1 represents the selection of m channels, and 0 represents the non-selection of m channels; k is a radical of n Represents the serving satellite selected by spot beam n, and G (θ) represents the antenna pattern; g max Is the maximum antenna gain; theta l,n Representation based on satellite k l The angle difference between spot beams l and n; delta 2 Represents the noise of the receiver at spot beam user n;
the frequency band utilization C of the spot beam n in the sub-channel m within the time slot t is expressed according to the following formula (2):
Figure BDA0003625555270000034
optionally, in S22, defining an action space and a state space of the deep Q network of the first layer of subchannel allocation layer and an action-critical network of the second layer of power allocation layer, respectively, and establishing a markov decision model, including:
s221: defining action space and state space of a deep Q network of a first layer of sub-channel distribution layer, initializing a current value function and a current network Q value, and performing distributed execution part of the system, wherein one spot beam is selected to represent an agent, and the state vector of the agent corresponding to the spot beam n in a first layer of time slot t is
Figure BDA0003625555270000035
Q-target network function of
Figure BDA0003625555270000036
Wherein psi is the depth Q network super parameter, the action which obtains the maximum Q value is selected>
Figure BDA0003625555270000037
Represents that the selected channel at time t is pick>
Figure BDA0003625555270000038
/>
S222: defining action space and state space of operator-critical network of second layer power distribution layer, policy network
Figure BDA0003625555270000039
Get an action>
Figure BDA00036255552700000310
The corresponding spot beam transmit power is pick-up>
Figure BDA00036255552700000311
Figure BDA00036255552700000312
Using the criticic network of the agent corresponding to the spot beam n to output the value function q (s, mu (s, theta); phi) of the current power selection action;
wherein theta is an actor network hyper-parameter, P max Achievable for spot beamsAnd the maximum power phi is critical network super parameter.
Optionally, in step S23, according to the markov decision model, a resource optimization problem of a spot beam is established, and an objective function is established, where the method includes:
s231: establishing a resource optimization problem of spot beams, and establishing an objective function as the following formula (3)
Figure BDA0003625555270000041
Wherein, T reuse Representing the maximum reuse times of a single sub-channel; n represents a spot beam set, and M represents a subchannel set; the constraint C1 is expressed as the maximum transmit power of each spot beam n is P max (ii) a The constraint condition C2 indicates that there are only two states of selection and non-selection for each sub-channel at the time slot t; constraint C3 indicates that each point beam n must select a sub-channel as a transmission channel at time slot t; the constraint C4 indicates that each subchannel m can be reused at most by T at time slot T reuse Second, i.e. sharing a maximum of T reuse The point beam can have a subchannel m in the time slot t;
s232: the first layer of sub-channel distribution layer and the second layer of power distribution layer select actions according to the feedback reward maximization and circularly accumulate reward values; wherein the sub-channel distribution layer and the power distribution layer share the same reward function
Figure BDA0003625555270000042
The reward function is composed of the frequency band utilization rate of the self spot beam in the time slot t and a penalty item which generates interference to the rest same-frequency beams.
Optionally, in step S3, initializing a deep neural network parameter, and initializing a current cost function and a current network Q value, including:
the depth Q structure network model parameters are
Figure BDA0003625555270000043
Wherein it is present>
Figure BDA0003625555270000051
Representing the number of neurons in an input layer of a deep Q neural network, in combination with a neural network>
Figure BDA0003625555270000052
Represents the number of hidden layer neurons in the deep Q neural network, and>
Figure BDA0003625555270000053
indicating the number of neurons in the output layer of the deep Q neural network, psi indicating the hyper-parameter of the deep Q neural network, psi target Representing and representing a depth Q-target neural network hyper-parameter;
the operator structure network model parameters are
Figure BDA0003625555270000054
Wherein +>
Figure BDA0003625555270000055
Expresses the number of neurons in the input layer of the operator-even deep neural network and the operator-target deep neural network, and/or the neuron is selected to be in the neural network>
Figure BDA0003625555270000056
Represents the number of hidden layer neurons in the operator-even deep neural network and the operator-target deep neural network, and/or the number of hidden layer neurons in the operator-even deep neural network>
Figure BDA0003625555270000057
Expressing the neuron number of the output layer of the operator-eval deep neural network and the operator-target deep neural network, theta expressing the ultra parameter of the operator-eval deep neural network, theta target Representing the ultra parameters of the operator-target deep neural network;
the critic structural network model parameters are
Figure BDA0003625555270000058
Wherein +>
Figure BDA0003625555270000059
Representing critical-eval deep neural networks and operator-target deep nervesNetwork input layer neuron number->
Figure BDA00036255552700000510
Representing the number of hidden layer neurons in the critical-eval deep neural network and the critical-target deep neural network, and based on the hidden layer neurons in the neural network and the hidden layer neurons in the bright-eval deep neural network>
Figure BDA00036255552700000511
Expressing the number of neurons of an output layer of the critical-eval deep neural network and the operator-target deep neural network, phi expressing the super parameter of the critical-eval deep neural network, phi target The representation represents the critic-target deep neural network hyper-parameter.
Optionally, in step S4, the off-line training of the neural network model by the multi-agent deep reinforcement learning method until the reward converges to the stable value, and performing joint allocation of sub-channels and power to the satellite multi-beam network based on the training result includes:
obtaining instant rewards through a Bellman equation, and developing a deep learning algorithm; the deep learning algorithm comprises a distributed execution part and a centralized training part; the distributed execution part is responsible for selecting new actions according to the Q value, generating new state values through interaction with the environment and storing the new state values into the top-layer memory pool D subband And a bottom layer memory pool D power Performing the following steps; the centralized training part is responsible for continuously updating parameters according to the instant rewards until the reward value tends to converge, and then the training and executing part is stopped, the optimal resource allocation scheme is obtained, and the learning is stopped.
In one aspect, there is provided a multibeam mega-constellation satellite frequency and power joint distribution apparatus, for use in an electronic device, adapted for use in the method of any one of the above, the apparatus comprising:
the network model establishing module is used for establishing a multi-beam network model of the giant constellation satellite;
the distribution problem establishing module is used for establishing a satellite network resource distribution problem with the maximized frequency band utilization rate and establishing a Markov decision model;
the parameter initialization module is used for initializing the deep neural network parameters and initializing a current value function and a current network Q value;
and the joint distribution module is used for off-line training the neural network model through a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing joint distribution of sub-channels and power on the satellite multi-beam network based on a training result.
Optionally, the network model building module is configured to set k satellites in a constellation system, each satellite uses n beams, the whole system has m subchannels, and the channel gain is represented as
Figure BDA0003625555270000061
Representing the channel gain at which satellite k arrives at spot beam n, the mth subchannel is used.
In one aspect, an electronic device is provided and includes a processor and a memory, where at least one instruction is stored in the memory and loaded by and executed by the processor to implement the one multi-beam mega constellation satellite frequency and power joint allocation method.
In one aspect, a computer-readable storage medium is provided, having stored therein at least one instruction, which is loaded and executed by a processor, to implement the above-mentioned one multi-beam mega-constellation satellite frequency and power joint allocation method.
The technical scheme of the embodiment of the invention at least has the following beneficial effects:
in the above solution, the present invention provides a multi-beam mega constellation satellite frequency and power joint distribution device for large-scale constellation ultra-dense networking, the method is composed of two layers of architectures, which are a sub-channel distribution layer and a spot beam power distribution layer: in the sub-channel distribution layer, due to discrete frequency band distribution, a deep Q learning algorithm is adopted to select the sub-channel of the spot beam; and in the power distribution layer, continuous spot beam power distribution is carried out in the physical layer by adopting a DDPG algorithm. Two layers perform centralized training and distributed execution operations together. The frequency allocation and the transmission power of the spot beam are optimized, and the utilization rate of a system frequency band is maximized. Taking the resource allocation optimization problem of each point beam of the satellite as a Markov decision process; resource allocation is performed on-line for the satellite multi-beam network based on intermittent training results to make it more stable and converge quickly.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flow chart of a multi-beam satellite frequency and power joint allocation according to an embodiment of the present invention;
fig. 2 is a flow chart of a multi-beam satellite frequency and power joint assignment according to an embodiment of the present invention;
fig. 3 is a flowchart of a deep reinforcement learning algorithm for jointly allocating multi-beam satellite frequencies and powers according to an embodiment of the present invention;
fig. 4 is a block diagram of a multi-beam mega-constellation satellite frequency and power joint distribution apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
The embodiment of the invention provides a multi-beam giant-constellation satellite frequency and power joint distribution method, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. As shown in fig. 1, the process flow of the multi-beam mega-constellation satellite frequency and power joint allocation method may include the following steps:
s101: establishing a multi-beam network model of a giant-constellation satellite;
s102: the method comprises the steps of establishing a satellite network resource allocation problem with maximized frequency band utilization rate, and establishing a Markov decision model;
s103: initializing a deep neural network parameter, and initializing a current value function and a current network Q value;
s104: and (3) training the neural network model in an off-line manner by a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing joint distribution of sub-channels and power on the satellite multi-beam network based on a training result.
Optionally, k satellites in a constellation system are set, each satellite uses n beams, the whole system has m subchannels, and the established channel gain is expressed as
Figure BDA0003625555270000081
Representing the channel gain at which satellite k arrives at spot beam n, the mth subchannel is used.
Optionally, in step S102, a satellite network resource allocation problem with a maximized frequency band utilization rate is established, and a markov decision model is established, which includes:
s121: combining the interference generated between the same-frequency beams, representing the carrier-to-interference-and-noise ratio of the point beam n in the time slot t, and calculating the frequency band utilization rate of the point beam n in the sub-channel m in the time slot t;
s122: respectively defining an action space and a state space of a deep Q network of a first layer of sub-channel distribution layer and an action space and a state space of an operator-critical network of a second layer of power distribution layer, and establishing a Markov decision model;
s123: and according to the Markov decision model, establishing a resource optimization problem of spot beams and establishing an objective function.
Optionally, in step S121, in combination with the interference generated between the same-frequency beams, characterizing a carrier-to-interference-and-noise ratio of the spot beam n at the time slot t, and calculating a frequency band utilization rate of the spot beam n in the sub-channel m in the time slot t, including:
combining the interference generated between the same-frequency beams, the carrier-to-interference-plus-noise ratio of the time-point beam n at the time slot t is characterized according to the following formula (1)
Figure BDA0003625555270000082
Figure BDA0003625555270000083
Wherein,
Figure BDA0003625555270000084
representing the channel selection case, 1 represents the selection of m channels, and 0 represents the non-selection of m channels; k is a radical of n Represents the serving satellite selected for spot beam n, and G (θ) represents the antenna pattern; g max Is the maximum antenna gain; theta.theta. l,n Representation based on satellite k l The angle difference between spot beams l and n; delta 2 Representing the noise of the receiver at spot beam user n;
the frequency band utilization C of the spot beam n in the sub-channel m in the time slot t is expressed according to the following formula (2):
Figure BDA0003625555270000091
optionally, in S122, an action space and a state space of the deep Q network of the first layer of subchannel allocation layer and an action-critical network of the second layer of power allocation layer are respectively defined, and the establishing the markov decision model includes:
s1221: defining action space and state space of a deep Q network of a first layer of sub-channel distribution layer, initializing a current value function and a current network Q value, and performing distributed execution part of the system, wherein one spot beam is selected to represent an agent, and the state vector of the agent corresponding to the spot beam n in a first layer of time slot t is
Figure BDA0003625555270000092
Q-target network function of
Figure BDA0003625555270000093
Wherein psi is the depth Q network super parameter, the action which obtains the maximum Q value is selected>
Figure BDA0003625555270000094
Represents that the selected channel at time t is pick>
Figure BDA0003625555270000095
S1222: defining action space and state space of operator-critical network of second layer power distribution layer, policy network
Figure BDA0003625555270000096
Get the action->
Figure BDA0003625555270000097
Then the corresponding spot beam transmit power is &>
Figure BDA0003625555270000098
Figure BDA0003625555270000099
Using the critic network of the agent corresponding to the spot beam n to output the value function q (s, mu (s, theta); phi) of the current power selection action;
wherein theta is an actor network hyper-parameter, P max Phi is the critical network super-parameter for the maximum power that the spot beam can reach.
Optionally, in step S123, establishing a resource optimization problem of the spot beam according to the markov decision model, and establishing an objective function, including:
s1231: establishing a resource optimization problem of spot beams, and establishing an objective function as the following formula (3)
Figure BDA00036255552700000910
Wherein, T reuse Representing the maximum reuse times of a single subchannel; n represents a spot beam set, and M represents a subchannel set; the constraint C1 is expressed as the maximum transmit power P of each spot beam n max (ii) a The constraint C2 is expressed asIn the time slot t, each sub-channel only has two states of selection and non-selection; the constraint condition C3 indicates that each point beam n must select a sub-channel as a transmission channel at time slot t; the constraint C4 indicates that each subchannel m can be reused at most by T at time slot T reuse Second, i.e. sharing a maximum of T reuse The spot beam can have a subchannel m in the time slot t;
s1232: the first layer of sub-channel distribution layer and the second layer of power distribution layer select actions according to the feedback reward maximization and circularly accumulate reward values; wherein the sub-channel distribution layer and the power distribution layer share the same reward function
Figure BDA00036255552700001012
The reward function is composed of the frequency band utilization rate of the self spot beam in the time slot t and a penalty item which generates interference to the rest same-frequency beams.
Optionally, in step S103, initializing deep neural network parameters, and initializing a current cost function and a current network Q value, including:
the depth Q structure network model parameters are
Figure BDA0003625555270000101
Wherein +>
Figure BDA0003625555270000102
Representing the number of neurons in an input layer of a deep Q neural network, in combination with a neural network>
Figure BDA0003625555270000103
Represents the number of hidden layer neurons in the deep Q neural network, and>
Figure BDA0003625555270000104
indicating the number of neurons in the output layer of the deep Q neural network, psi indicating the hyper-parameter of the deep Q neural network, psi target Representing and representing a depth Q-target neural network hyper-parameter;
the operator structure network model parameters are
Figure BDA0003625555270000105
Wherein +>
Figure BDA0003625555270000106
Represents the number of neurons in the input layer of the operator-eval deep neural network and the operator-target deep neural network, and/or the combination of the neurons in the input layer and the neurons in the input layer>
Figure BDA0003625555270000107
Represents the number of hidden layer neurons in the operator-even deep neural network and the operator-target deep neural network, and/or the number of hidden layer neurons in the operator-even deep neural network>
Figure BDA0003625555270000108
Expressing the neuron number of the output layer of the operator-eval deep neural network and the operator-target deep neural network, theta expressing the ultra parameter of the operator-eval deep neural network, theta target Representing the ultra parameters of the operator-target deep neural network;
the critic structural network model parameters are
Figure BDA0003625555270000109
Wherein +>
Figure BDA00036255552700001010
Represents the number of neurons in the input layer of the critic-eval deep neural network and the actor-target deep neural network, and then the neurons are/is>
Figure BDA00036255552700001011
Representing the number of hidden layer neurons in the critic-eval deep neural network and the critic-target deep neural network, and selecting the hidden layer neurons in the critic-eval deep neural network as the hidden layer neurons>
Figure BDA0003625555270000111
Expressing the number of neurons of an output layer of the critical-eval deep neural network and the operator-target deep neural network, phi expressing the super parameter of the critical-eval deep neural network, phi target The representation represents the critic-target deep neural network hyperparameter.
Optionally, in step S104, the off-line training of the neural network model by the multi-agent deep reinforcement learning method until the reward converges to the stable value, and performing joint distribution of sub-channels and power to the satellite multi-beam network based on the training result includes:
obtaining instant rewards through a Bellman equation, and developing a deep learning algorithm; the deep learning algorithm comprises a distributed execution part and a centralized training part; the distributed execution part is responsible for selecting new actions according to the Q value, generating new state values through interaction with the environment and storing the new state values into the top-layer memory pool D subband And a bottom memory pool D power Performing the following steps; the centralized training part is responsible for continuously updating parameters according to the instant rewards, and when the reward value tends to converge, the training and executing part is stopped, the optimal resource allocation scheme is obtained at the moment, and the learning is stopped.
In the embodiment of the invention, a multi-beam giant-constellation satellite frequency and power joint distribution device for a large-scale constellation ultra-dense networking is provided, and the method consists of two layers of frameworks which are respectively a sub-channel distribution layer and a spot beam power distribution layer: in the sub-channel distribution layer, due to discrete frequency band distribution, a deep Q learning algorithm is adopted to select the sub-channel of the spot beam; and in the power distribution layer, continuous spot beam power distribution is carried out in the physical layer by adopting a DDPG algorithm. Both layers perform centralized training and distributed execution operations together. The frequency allocation and the transmission power of spot beams are optimized, and the utilization rate of a system frequency band is maximized. Taking the resource allocation optimization problem of each point beam of the satellite as a Markov decision process; resource allocation is performed on-line for the satellite multi-beam network based on intermittent training results to make it more stable and converge quickly.
The embodiment of the invention provides a multi-beam giant-constellation satellite frequency and power joint distribution method, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. As shown in fig. 2, the process flow of the multi-beam mega-constellation satellite frequency and power joint allocation method may include the following steps:
s201: and establishing a multi-beam network model of the giant-constellation satellite.
In a possible embodiment, the invention first constructs a spot beam densityAnd (4) setting a deployed giant constellation network scene. Setting K satellites in a constellation system, wherein a satellite set is represented as K = {1,2,3 … K }; n spot beams are used in each satellite, the spot beam set denoted as N = {1,2,3 … N }; the total system has M subchannels set denoted as M = {1,2,3 … M }; typically n > m; l is the number of times each channel can be reused; the gain of the channel is established as
Figure BDA0003625555270000121
Representing the channel gain at which satellite k arrives at spot beam n, the mth subchannel is used. Wherein +>
Figure BDA0003625555270000122
t=1,2,3…。
β k→n Represents the large-scale fading from the satellite k to the ground point beam n, and can be known according to the Fris formula
Figure BDA0003625555270000123
Figure BDA0003625555270000124
Wherein G is t Denotes the transmission gain, G r Represents the receive gain, L represents a system loss factor independent of the propagation environment, d represents the antenna-to-ground terminal distance, and->
Figure BDA0003625555270000125
Representing small scale rayleigh fading.
S202: and (3) combining the interference generated among the same-frequency beams, representing the carrier-to-interference-and-noise ratio of the point beam n in the time slot t, and calculating the frequency band utilization rate of the point beam n in the sub-channel m in the time slot t.
In one possible implementation, the carrier-to-interference and noise ratio of the point beam n at the time slot t is characterized according to the following formula (1) in combination with the interference generated between the same-frequency beams
Figure BDA0003625555270000126
Figure BDA0003625555270000127
/>
Wherein,
Figure BDA0003625555270000128
representing the channel selection case, 1 represents the selection of m channels, and 0 represents the non-selection of m channels; k is a radical of formula n Representing the selected serving satellite of spot beam n, k n E is K, G (theta) represents an antenna directional diagram; g max Is the maximum antenna gain; theta l,n Representation based on satellite k l The angle difference between spot beams l and n; delta 2 Representing the noise of the receiver at spot beam n.
The frequency band utilization C of the spot beam n in the sub-channel m within the time slot t is expressed according to the following formula (2):
Figure BDA0003625555270000129
s203: and respectively defining the action space and the state space of a deep Q network of a first layer of sub-channel distribution layer and the action space and the state space of an operator-critical network of a second layer of power distribution layer, and establishing a Markov decision model.
In a possible implementation manner, an action space and a state space of a deep Q network of a first layer of sub-channel allocation layer are defined, a current cost function and a current network Q value are initialized, and a distributed execution part of the system is performed, wherein one spot beam is selected to represent an agent, and a state vector of the agent corresponding to the spot beam n in a first layer of time slot t is
Figure BDA0003625555270000131
The Q-target network function is ^ greater than>
Figure BDA0003625555270000132
Wherein psi is the depth Q network super parameter, and the action with the maximum Q value is selected
Figure BDA0003625555270000133
Represents at tThe selected channel is->
Figure BDA0003625555270000134
Defining action space and state space of operator-critical network of second layer power distribution layer, policy network
Figure BDA0003625555270000135
Get the action->
Figure BDA0003625555270000136
The corresponding spot beam transmit power is pick-up>
Figure BDA0003625555270000137
Using the criticic network of the agent corresponding to the spot beam n to output the value function q (s, mu (s, theta); phi) of the current power selection action;
where θ is an actor network hyper-parameter, P max Phi is the critical network super-parameter for the maximum power that the spot beam can reach.
In one possible implementation, the spot beam n sub-channel allocation in the time slot t is:
Figure BDA0003625555270000138
Figure BDA0003625555270000139
then all the spot beam sub-channels within the time slot t are allocated as: device for selecting or keeping>
Figure BDA00036255552700001310
Figure BDA00036255552700001311
Power allocation situation bit of each spot beam in time slot t:
Figure BDA00036255552700001312
s204: and according to the Markov decision model, establishing a resource optimization problem of spot beams and establishing an objective function.
In one possible embodiment, the resource optimization problem for spot beams is established by establishing an objective function as shown in equation (3) below
Figure BDA00036255552700001313
Wherein, T reuse Representing the maximum reuse times of a single sub-channel; n represents a spot beam set, and M represents a subchannel set; the constraint C1 is expressed as the maximum transmit power P of each spot beam n max (ii) a The constraint condition C2 indicates that there are only two states of selection and non-selection for each sub-channel at the time slot t; the constraint condition C3 indicates that each point beam n must select a sub-channel as a transmission channel at time slot t; the constraint C4 indicates that each subchannel m can be reused at most by T at time slot T reuse Second, i.e. sharing a maximum T reuse The spot beams may share a subchannel m in time slot t.
In a possible embodiment, in order to characterize the influence of co-channel interference on the multi-satellite multi-beam system after sub-channel allocation, two sets of parameters are further defined:
parameter 1:
Figure BDA0003625555270000141
represents all remaining spot beams n using the same subchannel as spot beam n in the time slot t sorting the strength of the interference, i.e. sorting according to>
Figure BDA0003625555270000142
Parameter 2:
Figure BDA0003625555270000143
representing a set of spot beams using m sub-channels in the vicinity of a spot beam n within a time slot t, i.e. spot beams
Figure BDA0003625555270000144
The ordering of the sets is according to the time slot t-1Sorting the significance of interference spot beam interference on neighboring spot beams>
Figure BDA0003625555270000145
S205: the first layer of sub-channel distribution layer and the second layer of power distribution layer select actions according to the feedback reward maximization and circularly accumulate reward values; wherein the sub-channel distribution layer and the power distribution layer share the same reward function
Figure BDA0003625555270000146
The reward function is composed of the frequency band utilization rate of the self spot beam in the time slot t and a penalty item which generates interference to the rest same-frequency beams.
In one possible implementation, the present invention employs a multi-agent deep reinforcement learning network for resource allocation to maximize the objective function. Each satellite spot beam is used as an agent, the rest agents are used as part of the environment for centralized training, and in order to describe the influence of action selection of each agent on the rest agents, each agent adopts the same reward function
Figure BDA0003625555270000147
Figure BDA0003625555270000148
Representing all the set of spot beams that are interfered by spot beam n.
In a possible implementation, an action space and a state space of a deep learning network are required. The algorithm needs to set two actions of allocating sub-channels and allocating power to the sub-channels, and defines the allocated sub-channels as
Figure BDA0003625555270000151
Figure BDA0003625555270000152
Defining an assignment of power to a sub-channel to be pick>
Figure BDA0003625555270000153
Set->
Figure BDA0003625555270000154
Then the spot beam transmit power is asserted to pick>
Figure BDA0003625555270000155
P max Is a constant, and refers to the maximum transmit power that can be achieved by the spot beam.
The state space is set to:
Figure BDA0003625555270000156
where p represents the transmit power of the spot beam n. The network needs to consider both sub-channel allocation and power control for the whole
Figure BDA0003625555270000157
While considering the utilization rate of the spot beam n frequency band, a penalty term needs to be set to indicate the influence of the spot beam n on the co-channel interference of the rest beams, so that the reward function is set as follows: />
Figure BDA0003625555270000158
Wherein,
Figure BDA0003625555270000159
a penalty term represents the influence of the spot beam n on all other spot beams with the same frequency, and the penalty principle is as follows: the larger the share ratio of the co-channel interference value generated by the spot beam in all interference power borne by the interfered beam is, the larger the penalty item is, and the smaller the corresponding reward is.
The deep reinforcement learning algorithm firstly needs to initialize the following parameters:
s206: initializing deep neural network parameters, and initializing a current cost function and a current network Q value.
In a possible implementation, the deep Q-structured network model is taken into accountIs numbered as
Figure BDA00036255552700001510
Figure BDA00036255552700001511
Wherein +>
Figure BDA00036255552700001512
Representing the number of neurons in an input layer of a deep Q neural network, in combination with a neural network>
Figure BDA00036255552700001513
Representing the number of hidden layer neurons in a deep Q neural network, in combination with a neural network>
Figure BDA00036255552700001514
Indicating the number of neurons in the output layer of the deep Q neural network, psi indicating the hyper-parameter of the deep Q neural network, psi target Representing and representing a depth Q-target neural network hyper-parameter;
the operator structure network model parameters are
Figure BDA0003625555270000161
Wherein it is present>
Figure BDA0003625555270000162
Expresses the number of neurons in the input layer of the operator-even deep neural network and the operator-target deep neural network, and/or the neuron is selected to be in the neural network>
Figure BDA0003625555270000163
Represents the number of hidden layer neurons in the operator-even deep neural network and the operator-target deep neural network, and/or the number of hidden layer neurons in the operator-even deep neural network>
Figure BDA0003625555270000164
Expressing the number of neurons of output layers of the operator-eval deep neural network and the operator-target deep neural network, expressing theta as ultra-parameter of the operator-eval deep neural network, and expressing theta as parameter target Representing the ultra parameters of the operator-target deep neural network;
the critic structural network model parameters are
Figure BDA0003625555270000165
Wherein it is present>
Figure BDA0003625555270000166
Represents the number of neurons in input layers of a critical-eval deep neural network and an operator-target deep neural network, and/or>
Figure BDA0003625555270000167
Representing the number of hidden layer neurons in the critic-eval deep neural network and the critic-target deep neural network, and selecting the hidden layer neurons in the critic-eval deep neural network as the hidden layer neurons>
Figure BDA0003625555270000168
Expressing the number of neurons of the output layer of the critic-eval deep neural network and the actor-target deep neural network, phi expressing the hyperparameter of the critic-eval deep neural network, phi target The representation represents the critic-target deep neural network hyperparameter.
In one possible implementation, each satellite spot beam acts as an agent, and consists of a deep Q network for configuring actions to select sub-channel behavior and output its Q-value on the top layer, and an operator-critic structure for selecting strategy configuration spot beam power and generating evaluation functions on the bottom layer.
S07: and training the neural network model in an off-line manner by a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing joint distribution of sub-channels and power on the satellite multi-beam network based on a training result.
In a feasible implementation mode, instant rewards are obtained through a Bellman equation, and a deep learning algorithm is developed; the deep learning algorithm comprises a distributed execution part and a centralized training part; the distributed execution part is responsible for selecting new actions according to the Q value, generating new state values through interaction with the environment and storing the new state values into the top-layer memory pool D subband And a bottom memory pool D power The preparation method comprises the following steps of (1) performing; the centralized training part is responsible for continuously updating the parameters according to the instant rewards until the reward value tends to convergeAnd terminating the training and executing part to obtain the optimal resource allocation scheme at the moment, and terminating the learning.
In one possible embodiment, the neural network model is trained offline until the reward converges to a stable value, as shown in fig. 3, and the specific steps can be divided into two parts, namely distributed execution and centralized training, wherein the distributed execution system is as follows:
(1) From the first layer system subchannel allocation layer, the state vector of the agent time slot t corresponding to the spot beam n
Figure BDA0003625555270000171
(2) Inputting the state information into the Q-target network according to
Figure BDA0003625555270000172
ψ arent Obtaining the action with the maximum q value
Figure BDA0003625555270000173
While setting the corresponding sub-channel selection to +>
Figure BDA0003625555270000174
(3) Will turn over
Figure BDA0003625555270000175
Memory pool D stored in subchannel training system subband In (1).
(4) The second layer system beam power distribution layer obtains the state vector of the agent time slot t corresponding to the spot beam n from the sub-channel distribution layer as
Figure BDA0003625555270000176
(corresponding to not needing to know all subchannel information at the bottom layer, but only the decision made at time t.)
(5) Entering state information into critical-target network in the second tier system, according to policy network
Figure BDA0003625555270000177
Get the action->
Figure BDA0003625555270000178
The corresponding spot beam transmit power is pick-up>
Figure BDA0003625555270000179
(6) Using the critic-target network of the agent to which the spot beam n corresponds to output a value function of the current power selection action
Figure BDA00036255552700001710
(7) Two actions obtained at time t
Figure BDA00036255552700001711
And/or>
Figure BDA00036255552700001712
Inputting the data into a wireless satellite system to obtain a new round of environmental state s (t+1)
The centralized training is as follows:
a. defining the transmission delay of the system as 1 time slot, in the centralized training part of the top layer, in the time slot t, the intelligent slave memory pool D corresponding to the point beam n subband The latest experience that can be obtained in (1) is expressed as
Figure BDA00036255552700001713
To D subband Sampling is carried out to obtain a mini-batch sample B s Then the mean square bellman error of the target Q network is expressed as: l (psi, D) = E (s,a,r′,s′)~D [(y(r′,s′)-q(s,a;ψ target )) 2 ]Wherein y (r ', s') = r '+ γ maxq' (s ', a'; ψ) target ) And gamma represents a discount coefficient.
b. The gradient descent method is used for minimizing the Bellman error function as follows:
Figure BDA0003625555270000181
updating the depth Q network parameter psi of the agent corresponding to the spot beam n and every T u Time slot updating target Q network parameter psi target ←ψ。
c. In the bottom centralized training part, the slave memory pool D of the agent corresponding to the point beam n is in the time slot t subband The latest experience that can be obtained in (1) is expressed as
Figure BDA0003625555270000182
To D power Sampling to obtain a mini-batch sample B p Then the mean square bellman error of the critical-target network is expressed as: l (φ, D) = E (s,a,r′,s′)~D [(y critic (r′ n,m ,s′ n,m )-q(s n,m ,a n,m ;φ)) 2 ]Wherein y is critic (r′,s′)=r′+ζmaxq′(s′,a′;φ target ) And ζ represents the discount coefficient. />
d. The gradient descent method is used for minimizing the Bellman error function as follows:
Figure BDA0003625555270000183
updating critic-eval network parameter phi of the agent corresponding to the spot beam n and every T u Time slot updating parameter phi of target Q network target ←φ。
e. The next update of the operator-target network is made, and since it is a deterministic policy, its loss function is expressed as L (θ, D) = E (s,a,r′,s′)~D [q(s,μ(s;θ);φ)]The minimization of the loss function according to the gradient ascent strategy is expressed as:
Figure BDA0003625555270000184
updating the operator-eval network parameter theta of the agent corresponding to the spot beam n and every T u Time slot updating parameter theta of target Q network target ←θ。
And finally, performing sub-channel and frequency resource allocation on the satellite multi-beam network on line based on an off-line training result through the distributed execution and the centralized training, so as to maximize the network frequency band utilization rate.
The embodiment of the invention provides a multi-beam satellite frequency and power combined distribution device for a large-scale constellation ultra-dense networking, which consists of two layers of frameworks, namely a sub-channel distribution layer and a spot beam power distribution layer: in the sub-channel distribution layer, due to discrete frequency band distribution, a deep Q learning algorithm is adopted to select the sub-channel of the spot beam; and in the power distribution layer, continuous spot beam power distribution is carried out on the physical layer by adopting a DDPG algorithm. Two layers perform centralized training and distributed execution operations together. The frequency allocation and the transmission power of the spot beam are optimized, and the utilization rate of a system frequency band is maximized. Taking the resource allocation optimization problem of each point beam of the satellite as a Markov decision process; resource allocation is performed on-line for the satellite multi-beam network based on intermittent training results to make it more stable and converge quickly.
Fig. 4 is a block diagram of a multi-beam mega-constellation satellite frequency and power joint distribution apparatus, according to an exemplary embodiment. The apparatus is suitable for use in any of the above methods. Referring to fig. 4, the apparatus 300 includes:
the network model establishing module 310 is used for establishing a network model of a giant constellation satellite multi-beam;
an allocation problem establishing module 320, configured to establish a satellite network resource allocation problem with a maximized frequency band utilization rate, and establish a markov decision model;
a parameter initialization module 330, configured to initialize deep neural network parameters, and initialize a current cost function and a current network Q value;
and the joint distribution module 340 is configured to train the neural network model offline by using a multi-agent deep reinforcement learning method until the reward converges to a stable value, and perform joint distribution of sub-channels and power to the satellite multi-beam network based on a training result.
Optionally, the network model establishing module 310 is further configured to set k satellites in one constellation system, n beams are used in each satellite, the entire system has m subchannels in total, and L is the number of times that each channel can be reused; the gain of the channel is established as
Figure BDA0003625555270000191
Representing the channel gain at which satellite k arrives at spot beam n, the mth subchannel is used.
Optionally, the allocation problem establishing module 320 is configured to combine interference generated between co-frequency beams, characterize a carrier-to-interference-plus-noise ratio of the spot beam n at the time slot t, and calculate a frequency band utilization rate of the spot beam n in the sub-channel m in the time slot t;
respectively defining an action space and a state space of a deep Q network of a first layer of sub-channel distribution layer and an action space and a state space of an operator-critical network of a second layer of power distribution layer, and establishing a Markov decision model;
and according to the Markov decision model, establishing a resource optimization problem of spot beams and establishing an objective function.
Optionally, the assignment problem establishing module 320 is configured to combine the interferences generated between the co-channel beams and characterize the carrier-to-interference and noise ratio of the point beam n at the time slot t according to the following formula (1)
Figure BDA0003625555270000192
Figure BDA0003625555270000193
Wherein,
Figure BDA0003625555270000194
representing the channel selection case, 1 represents the selection of m sub-channels, and 0 represents the non-selection of m channels; k is a radical of formula n Represents the serving satellite selected by spot beam n, and G (θ) represents the antenna pattern; g max Is the maximum antenna gain; theta l,n Representation based on satellite k l The angle difference between the spot beams l and n; delta 2 Representing the noise of the receiver at spot beam n;
the frequency band utilization C of the spot beam n in the sub-channel m in the time slot t is expressed according to the following formula (2):
Figure BDA0003625555270000201
optionally, the allocation problem establishing module 320 is configured to define an action space and a state space of a deep Q network of a first layer of sub-channel allocation layer, initialize a current cost function and a current network Q value, and perform a distributed execution part of the system, where one spot beam is selected to represent an agent, and a state vector of the agent corresponding to the spot beam n in the first layer of time slot t is
Figure BDA0003625555270000202
The Q-target network function is ^ greater than>
Figure BDA0003625555270000203
Wherein psi is a deep Q network super parameter, and selecting the action ^ which obtains the maximum Q value>
Figure BDA0003625555270000204
Represents that the selected channel at time t is pick>
Figure BDA0003625555270000205
Defining action space and state space of operator-critical network of second layer power distribution layer, policy network
Figure BDA0003625555270000206
Get an action>
Figure BDA0003625555270000207
Then the corresponding spot beam transmit power is &>
Figure BDA0003625555270000208
Using the criticic network of the agent corresponding to the spot beam n to output the value function q (s, mu (s, theta); phi) of the current power selection action;
where θ is an actor network hyper-parameter, P max Phi is the critical network super-parameter for the maximum power that the spot beam can reach.
Optionally, the assignment problem establishing module 320 is further configured to establish a resource optimization problem for the spot beam, establishing an objective function as the following formula (3)
Figure BDA0003625555270000209
Wherein, T reuse Representing the maximum reuse times of a single subchannel; n represents a spot beam set, and M represents a subchannel set; the constraint C1 is expressed as the maximum transmit power P of each spot beam n max (ii) a The constraint condition C2 indicates that there are only two states of selection and non-selection for each subchannel at the time slot t; constraint C3 indicates that each point beam n must select a sub-channel as a transmission channel at time slot t; the constraint C4 indicates that each subchannel m can be reused at most by T at time slot T reuse Second, i.e. sharing a maximum T reuse The spot beam can have a subchannel m in the time slot t;
the first layer of sub-channel distribution layer and the second layer of power distribution layer select actions according to the feedback reward maximization and circularly accumulate reward values; wherein the sub-channel distribution layer and the power distribution layer share the same reward function
Figure BDA0003625555270000211
The reward function consists of the frequency band utilization rate of the self point beam in the time slot t and a penalty item for generating interference on the rest same-frequency beams.
Optionally, a parameter initialization module 330 for the deep Q-structure network model parameters of
Figure BDA0003625555270000212
Wherein it is present>
Figure BDA0003625555270000213
Represents the number of neurons in the input layer of the deep Q neural network, and is/are>
Figure BDA0003625555270000214
Represents the number of hidden layer neurons in the deep Q neural network, and>
Figure BDA0003625555270000215
indicating the number of neurons in the output layer of the deep Q neural network, psi indicating the hyperparameter of the deep Q neural network, psi target Representing and representing a depth Q-target neural network hyper-parameter;
the operator structure network model parameters are
Figure BDA0003625555270000216
Wherein it is present>
Figure BDA0003625555270000217
Expresses the number of neurons in the input layer of the operator-even deep neural network and the operator-target deep neural network, and/or the neuron is selected to be in the neural network>
Figure BDA0003625555270000218
Represents the number of hidden layer neurons in the operator-even deep neural network and the operator-target deep neural network, and/or the number of hidden layer neurons in the operator-even deep neural network>
Figure BDA0003625555270000219
Expressing the neuron number of the output layer of the operator-eval deep neural network and the operator-target deep neural network, theta expressing the ultra parameter of the operator-eval deep neural network, theta target Representing the ultra parameters of the operator-target deep neural network;
the critic structural network model parameters are
Figure BDA00036255552700002110
Wherein it is present>
Figure BDA00036255552700002111
Represents the number of neurons in the input layer of the critic-eval deep neural network and the actor-target deep neural network, and then the neurons are/is>
Figure BDA0003625555270000221
Representing the number of hidden layer neurons in the critical-eval deep neural network and the critical-target deep neural network, and based on the hidden layer neurons in the neural network and the hidden layer neurons in the bright-eval deep neural network>
Figure BDA0003625555270000222
Expressing the number of neurons of an output layer of the critical-eval deep neural network and the operator-target deep neural network, phi expressing the super parameter of the critical-eval deep neural network, phi target The representation represents the critic-target deep neural network hyperparameter.
Optionally, the joint allocation module 340 is configured to obtain an instant reward through a bellman equation, and develop a deep learning algorithm; the deep learning algorithm comprises a distributed execution part and a centralized training part; the distributed execution part is responsible for selecting a new action according to the Q value, generating a new state value by interacting with the environment and storing the new state value into a top-layer memory pool D subband And a bottom memory pool D power The preparation method comprises the following steps of (1) performing; the centralized training part is responsible for continuously updating parameters according to the instant rewards until the reward value tends to converge, and then the training and executing part is stopped, the optimal resource allocation scheme is obtained, and the learning is stopped.
In the embodiment of the invention, a multi-beam giant-constellation satellite frequency and power joint distribution device for a large-scale constellation ultra-dense networking is provided, and the method consists of two layers of frameworks which are respectively a sub-channel distribution layer and a spot beam power distribution layer: in the sub-channel distribution layer, due to discrete frequency band distribution, a deep Q learning algorithm is adopted to select the sub-channel of the spot beam; and in the power distribution layer, continuous spot beam power distribution is carried out in the physical layer by adopting a DDPG algorithm. Both layers perform centralized training and distributed execution operations together. The frequency allocation and the transmission power of spot beams are optimized, and the utilization rate of a system frequency band is maximized. Taking the resource allocation optimization problem of each point beam of the satellite as a Markov decision process; resource allocation is performed on-line for the satellite multi-beam network based on intermittent training results to make it more stable and converge quickly.
Fig. 5 is a schematic structural diagram of an electronic device 400 according to an embodiment of the present invention, where the electronic device 400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to implement the following multi-beam mega constellation satellite frequency and power joint allocation method steps:
s1: establishing a multi-beam network model of a giant constellation satellite;
s2: according to the network model, a satellite network resource allocation problem with the maximized frequency band utilization rate is established, a Markov decision model is established, and a resource optimization problem of spot beams is established;
s3: constructing a deep neural network model according to the resource optimization problem of the spot beams; initializing a deep neural network parameter, and initializing a current value function and a current network Q value;
s4: and training the deep neural network model in an off-line manner by using a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing combined distribution of frequency and power on the satellite multi-beam network based on a training result.
In an exemplary embodiment, a computer-readable storage medium, such as a memory including instructions executable by a processor in a terminal, to perform the course learning based multi-beam mega constellation satellite frequency and power joint allocation method described above, is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A multibeam giant-constellation satellite frequency and power joint distribution method is characterized by comprising the following steps:
s1: establishing a multi-beam network model of a giant constellation satellite;
s2: according to the network model, a satellite network resource allocation problem with the maximized frequency band utilization rate is established, a Markov decision model is established, and a resource optimization problem of spot beams is established;
s3: constructing a deep neural network model according to the resource optimization problem of the spot beams; initializing a deep neural network parameter, and initializing a current value function and a current network Q value;
s4: and training the deep neural network model in an off-line manner by using a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing combined distribution of frequency and power on the satellite multi-beam network based on a training result.
2. The method according to claim 1, wherein the parameters of the network model of giant constellation satellite multi-beams comprise: setting k satellites in a constellation system, wherein each satellite uses n wave beams, and the whole system has m sub-channels; the gain of the channel is established as
Figure FDA0004076889940000011
Representing the channel gain at time slot t for satellite k to arrive at spot beam n, the mth subchannel is used.
3. The method of claim 2, wherein in step S2, establishing a satellite network resource allocation problem with maximized frequency band utilization, establishing a markov decision model, and establishing a resource optimization problem for spot beams comprises:
s21: combining the interference generated between the same-frequency beams, representing the carrier-to-interference-and-noise ratio of the point beam n at the time slot t, and calculating the frequency band utilization rate of the point beam n in the sub-channel m in the time slot t;
s22: respectively defining an action space and a state space of a deep Q network of a first layer of sub-channel distribution layer and an action space and a state space of an operator-critical network of a second layer of power distribution layer, and establishing a Markov decision model;
s23: and according to the Markov decision model, establishing a resource optimization problem of the spot beam and establishing an objective function.
4. The method according to claim 3, wherein in step S21, in combination with the interference generated between the co-channel beams, characterizing the carrier-to-interference-and-noise ratio of the spot beam n at the time slot t, and calculating the frequency band utilization rate of the spot beam n in the sub-channel m in the time slot t, includes:
combining the interference generated between the same-frequency beams, the carrier-to-interference-plus-noise ratio of the time-point beam n at the time slot t is characterized according to the following formula (1)
Figure FDA0004076889940000021
Figure FDA0004076889940000022
Wherein,
Figure FDA0004076889940000023
representing the channel selection case, 1 represents the selection of m channels, and 0 represents the non-selection of m channels; k is a radical of n Represents the serving satellite selected by spot beam n, and G (θ) represents the antenna pattern; g max Is the maximum antenna gain; theta l,n Representation based on satellite k l The angle difference between the spot beams l and n; delta. For the preparation of a coating 2 Represents the noise of the receiver at spot beam user n; />
Figure FDA0004076889940000024
Transmitting power for the corresponding spot beam;
the frequency band utilization C of the spot beam n in the sub-channel m within the time slot t is expressed according to the following formula (2):
Figure FDA0004076889940000025
5. the method of claim 3, wherein the step S22 of defining the action space and the state space of the deep Q network of the first layer of sub-channel allocation layer and the action space and the state space of the operator-critical network of the second layer of power allocation layer respectively to build the Markov decision model comprises:
s221: defining action space and state space of a deep Q network of a first layer of sub-channel distribution layer, initializing a current cost function and a current network Q value, and performing distributed execution part of the system, wherein one spot beam is selected to represent an agent, and the state vector of the agent corresponding to the spot beam n in a first layer time slot t is
Figure FDA0004076889940000026
Q-target network function of
Figure FDA0004076889940000027
Wherein psi is the depth Q network super parameter, the action which obtains the maximum Q value is selected>
Figure FDA0004076889940000028
Represents that the selected channel at time t is pick>
Figure FDA0004076889940000029
S222: defining action space and state space of operator-critical network of second layer power distribution layer, policy network
Figure FDA00040768899400000210
Get an action>
Figure FDA00040768899400000211
The corresponding spot beam transmit power is pick-up>
Figure FDA00040768899400000212
Figure FDA00040768899400000213
Using the criticic network of the agent corresponding to the spot beam n to output the value function q (s, mu (s, theta); phi) of the current power selection action;
wherein theta is an actor network hyper-parameter, P max Phi is the maximum power that the spot beam can reach, and phi is the critical network hyper-parameter.
6. The method of claim 5, wherein in step S23, establishing a resource optimization problem for spot beams according to the Markov decision model, and establishing an objective function comprises:
establishing a resource optimization problem of the spot beam, and establishing an objective function as the following formula (3)
Figure FDA0004076889940000031
Wherein, T reuse Representing the maximum reuse times of a single subchannel; n represents a spot beam set, and M represents a subchannel set; the constraint C1 is expressed as the maximum transmit power of each spot beam n is P max (ii) a The constraint condition C2 indicates that there are only two states of selection and non-selection for each sub-channel at the time slot t; the constraint condition C3 indicates that each point beam n must select a sub-channel as a transmission channel at time slot t; the constraint C4 indicates that each subchannel m can be reused at most by T at time slot T reuse Second, i.e. sharing a maximum T reuse The spot beam can have a subchannel m in the time slot t;
the first layer of sub-channel distribution layer and the second layer of power distribution layer select actions according to the feedback reward maximization and circularly accumulate reward values; wherein the sub-channel distribution layer and the power distribution layer share the same reward function
Figure FDA0004076889940000032
The reward function is formed by the self point beam in the time slot tAnd a penalty term for causing interference to the rest of the same-frequency beams.
7. The method according to claim 1, wherein in step S3, initializing deep neural network parameters, initializing current cost function and current network Q value, comprises:
the depth Q structure network model parameters are
Figure FDA0004076889940000033
Wherein it is present>
Figure FDA0004076889940000034
Represents the number of neurons in the input layer of the deep Q neural network, and is/are>
Figure FDA0004076889940000035
Represents the number of hidden layer neurons in the deep Q neural network, and>
Figure FDA0004076889940000036
indicating the number of neurons in the output layer of the deep Q neural network, psi indicating the hyperparameter of the deep Q neural network, psi target Representing the super parameter of the neural network representing the depth Q-target; />
The operator structure network model parameters are
Figure FDA0004076889940000041
Wherein it is present>
Figure FDA0004076889940000042
Represents the number of neurons in the input layer of the operator-eval deep neural network and the operator-target deep neural network, and/or the combination of the neurons in the input layer and the neurons in the input layer>
Figure FDA0004076889940000043
Represents the number of hidden layer neurons in the operator-even deep neural network and the operator-target deep neural network, and/or the number of hidden layer neurons in the operator-even deep neural network>
Figure FDA0004076889940000044
Expressing the number of neurons of output layers of the operator-eval deep neural network and the operator-target deep neural network, expressing theta as ultra-parameter of the operator-eval deep neural network, and expressing theta as parameter target Representing the ultra parameters of the operator-target deep neural network;
the critic structural network model parameters are
Figure FDA0004076889940000045
Wherein it is present>
Figure FDA0004076889940000046
Represents the number of neurons in the input layer of the critic-eval deep neural network and the actor-target deep neural network, and then the neurons are/is>
Figure FDA0004076889940000047
Representing the number of hidden layer neurons in the critic-eval deep neural network and the critic-target deep neural network, and selecting the hidden layer neurons in the critic-eval deep neural network as the hidden layer neurons>
Figure FDA0004076889940000048
Expressing the number of neurons of an output layer of the critical-eval deep neural network and the operator-target deep neural network, phi expressing the super parameter of the critical-eval deep neural network, phi target The representation represents the critic-target deep neural network hyperparameter.
8. The method according to claim 1, wherein in step S4, the neural network model is trained offline through a multi-agent deep reinforcement learning method until the reward converges to a stable value, and the joint allocation of sub-channels and power is performed on the satellite multi-beam network based on the training result, comprising:
obtaining instant rewards through a Bellman equation, and developing a deep learning algorithm; the deep learning algorithm comprises a distributed execution part and a centralized training part; the distributed execution part is responsible for selecting a new action according to the Q value, generating a new state value by interacting with the environment and storing the new state value into a top-layer memory pool D subband And a bottom layer memory pool D power The preparation method comprises the following steps of (1) performing; the centralized training part is responsible for continuously updating parameters according to the instant rewards until the reward value tends to converge, and then the training and executing part is stopped, the optimal resource allocation scheme is obtained, and the learning is stopped.
9. A multi-beam mega constellation satellite frequency and power joint distribution apparatus, adapted for use in the method of any of the preceding claims 1-8, the apparatus comprising:
the network model building module is used for building a multi-beam network model of a giant constellation satellite;
the distribution problem establishing module is used for establishing a satellite network resource distribution problem with the maximized frequency band utilization rate and establishing a Markov decision model;
the parameter initialization module is used for initializing the deep neural network parameters and initializing the current value function and the current network Q value;
and the joint distribution module is used for off-line training the neural network model through a multi-agent deep reinforcement learning method until the reward converges to a stable value, and performing joint distribution of sub-channels and power on the satellite multi-beam network based on a training result.
10. The apparatus of claim 9, wherein the network model building module is configured to set k satellites in a constellation system, each satellite having n beams, the entire system having m subchannels, and wherein the channel gain is expressed as
Figure FDA0004076889940000051
Representing the channel gain at which satellite k arrives at spot beam n, the mth subchannel is used. />
CN202210468483.2A 2022-04-29 2022-04-29 Multi-beam giant-constellation satellite frequency and power joint distribution method and device Active CN114978278B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210468483.2A CN114978278B (en) 2022-04-29 2022-04-29 Multi-beam giant-constellation satellite frequency and power joint distribution method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210468483.2A CN114978278B (en) 2022-04-29 2022-04-29 Multi-beam giant-constellation satellite frequency and power joint distribution method and device

Publications (2)

Publication Number Publication Date
CN114978278A CN114978278A (en) 2022-08-30
CN114978278B true CN114978278B (en) 2023-04-14

Family

ID=82980189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210468483.2A Active CN114978278B (en) 2022-04-29 2022-04-29 Multi-beam giant-constellation satellite frequency and power joint distribution method and device

Country Status (1)

Country Link
CN (1) CN114978278B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115441939B (en) * 2022-09-20 2024-03-22 深圳泓越信息科技有限公司 MADDPG algorithm-based multi-beam satellite communication system resource allocation method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113572517A (en) * 2021-07-30 2021-10-29 哈尔滨工业大学 Beam hopping resource allocation method, system, storage medium and equipment based on deep reinforcement learning

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9473250B2 (en) * 2014-09-03 2016-10-18 Mitsubishi Electric Research Laboratories, Inc. System and method for recovering carrier phase in optical communications
EP3637324A1 (en) * 2018-10-08 2020-04-15 Nokia Solutions and Networks Oy Geometric constellation shaping for optical data transport
CN111200459B (en) * 2020-01-09 2022-01-28 南京凯瑞得信息科技有限公司 Channel allocation and power control method for uplink multi-beam satellite
WO2021254607A1 (en) * 2020-06-17 2021-12-23 European Space Agency Pragmatic solutions for massive mimo for broadband telecommunication systems
CN112566253B (en) * 2020-11-10 2022-09-06 北京科技大学 Wireless resource allocation joint optimization method and device
CN113691332B (en) * 2021-09-10 2022-05-27 北京科技大学 Co-channel interference characterization method and device of low-earth-orbit satellite communication system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113572517A (en) * 2021-07-30 2021-10-29 哈尔滨工业大学 Beam hopping resource allocation method, system, storage medium and equipment based on deep reinforcement learning

Also Published As

Publication number Publication date
CN114978278A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
Hu et al. Dynamic beam hopping method based on multi-objective deep reinforcement learning for next generation satellite broadband systems
CN113162682B (en) PD-NOMA-based multi-beam LEO satellite system resource allocation method
Hu et al. Multi-agent deep reinforcement learning-based flexible satellite payload for mobile terminals
Zhao et al. RIS-aided ground-aerial NOMA communications: A distributionally robust DRL approach
US11265077B1 (en) Method for deploying task in satellite network
CN111431646B (en) Dynamic resource allocation method in millimeter wave system
CN115021799B (en) Low-orbit satellite switching method based on multi-agent cooperation
CN114978278B (en) Multi-beam giant-constellation satellite frequency and power joint distribution method and device
Ortiz-Gomez et al. Cooperative multi-agent deep reinforcement learning for resource management in full flexible VHTS systems
Cui et al. Latency optimization for hybrid GEO–LEO satellite-assisted IoT networks
CN113162666B (en) Intelligent steel-oriented large-scale MIMO hybrid precoding method and device
CN113490219A (en) Dynamic resource allocation method for ultra-dense networking
CN116436512A (en) Multi-objective optimization method, system and equipment for RIS auxiliary communication
CN113613301A (en) Air-space-ground integrated network intelligent switching method based on DQN
CN114786258A (en) Wireless resource allocation optimization method and device based on graph neural network
Gong et al. When optimization meets machine learning: The case of IRS-assisted wireless networks
Siddiqi et al. Deep reinforcement based power allocation for the max-min optimization in non-orthogonal multiple access
CN111783932B (en) Method and device for training neural network
CN116886154A (en) Low-orbit satellite access method and system based on flow density
CN116865885A (en) Resource allocation method and system
Ismath et al. Deep contextual bandits for fast initial access in mmWave based user-centric ultra-dense networks
CN116318371A (en) Communication resource allocation method and device for satellite Internet and readable storage medium
Alsenwi et al. Robust Beamforming for Massive MIMO LEO Satellite Communications: A Risk-Aware Learning Framework
CN116781141A (en) LEO satellite cooperative edge computing and unloading method based on deep Q network
CN115483960A (en) Beam hopping scheduling method, system, device and storage medium for low-earth-orbit satellite

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant