CN113438723B - Competition depth Q network power control method with high rewarding punishment - Google Patents

Competition depth Q network power control method with high rewarding punishment Download PDF

Info

Publication number
CN113438723B
CN113438723B CN202110701419.XA CN202110701419A CN113438723B CN 113438723 B CN113438723 B CN 113438723B CN 202110701419 A CN202110701419 A CN 202110701419A CN 113438723 B CN113438723 B CN 113438723B
Authority
CN
China
Prior art keywords
user
signal
noise ratio
secondary user
primary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110701419.XA
Other languages
Chinese (zh)
Other versions
CN113438723A (en
Inventor
刘骏
刘德荣
王永华
林得有
王宇慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tuo Ai Wei Information Technology Co ltd
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202110701419.XA priority Critical patent/CN113438723B/en
Publication of CN113438723A publication Critical patent/CN113438723A/en
Application granted granted Critical
Publication of CN113438723B publication Critical patent/CN113438723B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/06TPC algorithms
    • H04W52/14Separate analysis of uplink or downlink
    • H04W52/146Uplink power control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/24TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
    • H04W52/241TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account channel quality metrics, e.g. SIR, SNR, CIR, Eb/lo
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/24TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
    • H04W52/242TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account path loss
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/24TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
    • H04W52/245TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account received signal strength
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/30TPC using constraints in the total amount of available transmission power
    • H04W52/34TPC management, i.e. sharing limited amount of power among users or channels or data types, e.g. cell loading
    • H04W52/346TPC management, i.e. sharing limited amount of power among users or channels or data types, e.g. cell loading distributing total power among users or channels
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention provides a competition depth Q network power control method with high rewarding punishment, which improves rewarding functions in a deep reinforcement learning process, classifies the rewarding functions according to the condition of spectrum access of secondary users, and gives different actions with different rewarding values. Giving high rewards to the most successful actions of the most correct access and giving high penalties to the most failed actions of the most incorrect access, so that the system can quickly explore the strategy of successful access; the competition depth Q network is combined with the grading rewarding function of the high rewarding penalty and is applied to dynamic power control of the frequency spectrum, so that the stability of the system can be effectively improved, the total throughput of secondary users can be improved, the power loss is reduced, and the effect of saving energy is achieved.

Description

Competition depth Q network power control method with high rewarding punishment
Technical Field
The invention relates to the field of cognitive radio control methods, in particular to a competition depth Q network power control method with high rewarding punishment.
Background
With rapid development and wide use of wireless communication technologies, the demand for spectrum resources is increasing, but in contrast to the rapid reality that wireless spectrum resources are becoming depleted, which is becoming a major problem in further development of wireless communication technologies. However, most of the allocation of spectrum resources currently uses a relatively conventional and fixed allocation method, that is, a specific frequency band is assigned to a specific user, and other users need to be authorized to use the spectrum resources. Extensive research in academia and industry shows that on one hand, a large amount of spectrum resources are not really used by authorized users, a large amount of authorized frequency bands are in idle states, the idle frequency band use rate of the authorized users is low, and on the other hand, the authorized frequency bands are the cursory robbery and congestion of public frequency band spectrum resources. Therefore, how to solve these contradictions in the spectrum resource allocation process, it is very important to improve the spectrum utilization.
The concept of Cognitive Radio (CR) technology is to alleviate the problems of shortage of spectrum resources and low spectrum utilization. The cognitive process of cognitive radio is divided into six steps, namely localization (Orient), observation (Observe), learning (student), decision (Decode), planning (Plan) and action (Act), respectively. The cognitive radio intelligently adjusts self decision and positioning through observing and learning the external environment, realizes corresponding plan and action, and performs self-adaptive adjustment process on the external environment. For spectrum sharing, the cognitive radio core ideas are: on the premise that no interference is generated to an authorized User (PU) obtaining the spectrum use right, a Secondary User (SU) performs spectrum access by sensing surrounding radio environment, thereby improving the spectrum utilization rate.
Based on Reinforcement Learning (RL), deep reinforcement learning algorithms developed in combination with deep learning achieve levels comparable to humans in many artificial intelligence fields, such as Weiqi, dota, startCraft II, etc. Specifically, the deep Q-networks (DQN) combines the RL process with a neural network (deep neural network) to approach the Q action value function, which can make up for the limitation of Q learning in terms of generalization and function approximation capability. The competition depth Q network (lasting DQN) is an improvement of the algorithm based on the common DQN, and the value of the state and the action advantage value under the state are summed to be used as Q value for reevaluation.
In the latest researches, researchers apply the DQN algorithm to spectrum allocation, and simulation results show that the algorithm has higher convergence speed and lower packet loss rate. In order to overcome the challenges of unknown dynamic industrial Internet of things environments, a learner also provides an improved deep Q learning network applied to the management of industrial Internet of things spectrum resources. And researchers also apply the competition deep reinforcement learning algorithm to the prediction of the heavy metal content of soil, and can obtain a relatively good effect. However, these deep reinforcement learning methods do not combine the value of the state and the action value in the state at the same time, or often do not divide the classes of the reward function according to the success condition of spectrum access when designing the reward function.
Disclosure of Invention
The invention provides a competition depth Q network power control method with high rewarding punishment, which considers the values of states and actions at the same time and sums the states and actions for re-evaluation, thereby effectively improving the system stability.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a contention-depth Q network power control method for high rewards penalty, comprising the steps of:
s1: the auxiliary base station collects communication information of the primary user and the secondary user and transmits the obtained information to the secondary user;
s2: setting the transmitting power selected by the secondary user in each time slot as an action value, and constructing an action space;
s3: constructing a grading rewarding function of high rewarding penalty;
s4: and constructing a power control strategy.
Further, the specific process of the step S1 is:
because the primary user and the secondary user are in a non-cooperative relationship, the secondary user is connected into the primary user channel in a pad mode, the primary user and the secondary user can not know the power transmission strategies of the primary user and the secondary user, and in the signal transmission process, the auxiliary base station plays an important role and is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user. Assuming that there are X secondary base stations in the environment, the state values are:
S(t)=[s 1 (t),s 2 (t),...,s k (t),...,s x (t)]
the signal strength received by the kth auxiliary base station is defined as:
Figure GDA0004086074100000021
wherein, I ik (t)、l jk (t) the distance between the auxiliary base station and the primary and secondary users at time t, respectively, l 0 (t) represents the reference distance, τ represents the path loss index, σ (t) represents the average noise power of the system; at time t, secondary user k is in state s k (t) selecting an action at which time the secondary user will enter s k The next state of (t).
Further, in step S2, the transmission power selected by the secondary users in each time slot is set to be an action value, the transmission power of each secondary user is a discretized value, and each secondary user selects H different transmission values, so H is shared n A selectable action space is defined as:
A(t)=[P 1 (t),P 2 (t),...,P n (t)]。
further, in step S3, four indexes are designed to evaluate the success level of spectrum access of the secondary user, and the indexes are defined as follows:
Figure GDA0004086074100000031
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure GDA0004086074100000032
and->
Figure GDA0004086074100000033
Represents the signal to noise ratio, mu, of any primary user and any secondary user, respectively i Sum mu j Threshold values respectively preset for the primary user and the secondary user, < ->
Figure GDA0004086074100000034
Sum sigma P j Respectively represent arbitrary access messagesThe sum of the primary user power and the secondary user transmit power of the channel;
in step S3, whether the signal-to-noise ratio of any main user is greater than a preset threshold is defined as a most prerequisite for judging whether the power control is successful, if the signal-to-noise ratio of any main user is not greater than the preset threshold, the complete failure CF of the spectrum access can be directly judged; if the signal-to-noise ratio of any primary user is larger than a preset threshold value, but the signal-to-noise ratio of no secondary user is higher than the preset threshold value, the situation is called as secondary access failure SF; if the signal-to-noise ratio of any primary user is larger than a preset threshold value, the signal-to-noise ratio of any secondary user is also larger than the preset threshold value, and the transmitting power of the primary user of all access channels is larger than the sum of the transmitting powers of the secondary users, the access mode is called complete access success CS; in the condition of successful complete access, if only part of secondary users have higher signal-to-noise ratio than a preset threshold and the rest conditions are unchanged, the access mode is called as secondary access successful SS, and the specific formula is expressed as follows:
Figure GDA0004086074100000035
according to the above grading conditions, a bonus function is defined as:
Figure GDA0004086074100000041
in the above, a 1 >10a 2 ,a 3 >10a 4 And grading the reward function according to the spectrum access success condition, giving high rewards to the complete access success of the secondary users, giving high penalties to the complete access failure of the secondary users, and enabling the system to explore the successful access strategy more quickly.
Further, in step S4, the main user is defined to perform power transmission according to the following strategy, and the power control strategy is as follows:
Figure GDA0004086074100000042
Figure GDA0004086074100000043
under the strategy, the main user controls the transmitting power in a stepwise updating mode at each time point t;
signal to noise ratio gamma of master user i at time t i (t)≤μ i And the master user i predicts the signal-to-noise ratio gamma 'at the time t+1' i (t)≥μ i When the primary user increases the transmitting power; signal to noise ratio gamma of master user i at time t i (t)≥μ i And the master user i predicts the signal-to-noise ratio gamma 'at the time t+1' i (t)≥μ i When the main user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the predicted t+1 moment of the main user i is as follows:
Figure GDA0004086074100000044
the secondary user accesses to the channel of the main user through the pad, and in order not to influence the normal communication of the main user, the secondary user often has strict requirements when transmitting power; to avoid affecting normal communication of the primary user, the secondary user is required to continuously learn the data information collected from the auxiliary base station, and then the communication transmission task is completed with proper transmitting power; the signal to noise ratio is an important indicator for measuring the link quality. Defining the signal-to-noise ratio of the ith main user as follows:
Figure GDA0004086074100000045
defining the signal-to-noise ratio of the jth secondary user as:
Figure GDA0004086074100000046
wherein h is ii And h jj Respectively represent the ith main user and the ithChannel gains, P, of j secondary users i (t) and P j (t) the transmission power of the ith primary user and the jth secondary user at the moment t, h ij (t)、h ji (t)、h kj (t) represents channel gains between the ith and jth primary users, the jth and ith secondary users, and the kth and jth secondary users, respectively, N i (t) and N j (t) representing the ambient noise received by the ith primary user and the jth secondary user, respectively; the channel gain, the transmitting power and the like are dynamically changed, and according to the shannon theorem, the relationship between the throughput of the jth secondary user and the signal to noise ratio is defined as follows:
T j (t)=Wlog 2 (1+γ j (t))
in the dynamically-changing system, the optimal power distribution effect of the system is ensured, the signal-to-noise ratio of a main user is higher than a preset threshold value, and the secondary user can adjust the self-transmitting power through continuous learning, so that the total throughput of the secondary user in the whole system is maximized.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention improves the rewarding function in the deep reinforcement learning process, classifies the rewarding function according to the spectrum access condition of the secondary user, and gives different actions with different rewarding values. Giving high rewards to the most successful actions of the most correct access and giving high penalties to the most failed actions of the most incorrect access, so that the system can quickly explore the strategy of successful access; the competition depth Q network is combined with the grading rewarding function of the high rewarding penalty and is applied to dynamic power control of the frequency spectrum, so that the stability of the system can be effectively improved, the total throughput of secondary users can be improved, the power loss is reduced, and the effect of saving energy is achieved.
Drawings
FIG. 1 is a diagram of a system in which the method of the present invention is employed;
FIG. 2 is a diagram of a generic DQN network configuration;
FIG. 3 is a diagram of a structure of a lasting DQN network;
FIG. 4 is a graph comparing loss functions of three different deep reinforcement learning algorithms;
FIG. 5 is a plot of jackpot for three different deep reinforcement learning algorithms trained 40000 times;
FIG. 6 is a graph of the total throughput of a secondary user trained 40000 times by three different deep reinforcement learning algorithms;
fig. 7 is a graph of average transmit power for the next user of three different deep reinforcement learning algorithms.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
As shown in fig. 1, in a certain area centered on a Primary Base Station (PBS), it is assumed that m Primary Users (PUs) and n Secondary Users (SU) (n > m) exist in a cognitive wireless network, 1 primary base station and several secondary base stations (ABS), and the primary users, secondary users and secondary base stations are randomly distributed in a network environment. The main base station can ensure the normal operation of the communication of the main user, the auxiliary base station can collect the received signal strength information of the main user and the received signal strength information of the secondary user, and the collected data information can be sent to the secondary user again.
In this model, the secondary user accesses the channel of the primary user through the underlay, and in order not to affect the normal communication of the primary user, the secondary user often has strict requirements when transmitting power. To avoid affecting the normal communication of the primary user, it is necessary for the secondary user to learn continuously the data information collected from the secondary base station and then complete the communication transmission task with a suitable transmit power.
The signal to noise ratio is an important indicator for measuring the link quality. Defining the signal-to-noise ratio of the ith main user as follows:
Figure GDA0004086074100000061
defining the signal-to-noise ratio of the jth secondary user as:
Figure GDA0004086074100000062
wherein h is ii And h jj Respectively representing channel gains of the ith main user and the jth secondary user, P i (t) and P j (t) the transmission power of the ith primary user and the jth secondary user at the moment t, h ij (t)、h ji (t)、h kj (t) represents channel gains between the ith and jth primary users, the jth and ith secondary users, and the kth and jth secondary users, respectively, N i (t) and N j (t) representing the ambient noise received by the ith primary user and the jth secondary user, respectively.
The channel gain, the transmitting power and the like of the model are dynamically changed, and according to the shannon theorem, the relationship between the throughput of the jth secondary user and the signal to noise ratio is defined as follows:
T j (t)=Wlog 2 (1+γ j (t)) (3)
in the dynamically-changing system, the optimal power distribution effect of the system is ensured, the signal-to-noise ratio of a main user is higher than a preset threshold value, and the secondary user can adjust the self-transmitting power through continuous learning, so that the total throughput of the secondary user in the whole system is maximized.
The invention aims to adopt a lasting DQN and improve a rewarding function thereof to control the dynamic power of a frequency spectrum, and a secondary user can adaptively adjust the self-transmitting power according to information obtained from an auxiliary base station so as to finish the dynamic power control of a cognitive wireless network.
Like the conventional DQN algorithm, the forcing DQN algorithm has the same network structure as the conventional DQN, i.e. an environment, a playback memory unit, two neural networks of the same structure but different parameters and error functions. The deep reinforcement learning-based approach to dealing with the spectrum power control problem is essentially a markov decision process. The general DQN proposes to approximate the optimal control strategy using an action value function Q (s, a):
Figure GDA0004086074100000071
the Dueling DQN re-evaluates the sum of the value of the state and the advantage value of the action in the state as the Q value, and the core content of the competition depth Q network, which is different from the common depth Q network, is expressed as follows:
Q(s,a;θ,α,β)=V(s;θ,β)+A(s,a;θ,a) (5)
the comparison of the network structures of the DQN and the lasting DQN is shown in FIGS. 2 and 3, and it can be seen that the lasting DQN has two data streams before the output layer, one of which outputs the Q value of the state and the other of which outputs the advantage value of the operation.
1) Status of
The primary user and the secondary user of the system model are in a non-cooperative relationship, the secondary user is connected into a primary user channel in a pad mode, and the primary user and the secondary user cannot acquire the power transmission strategies of the primary user and the secondary user. In the signal transmission process, the auxiliary base station plays an important role, and is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user. Assuming that there are X secondary base stations in the environment, the state values are:
S(t)=[s 1 (t),s 2 (t),...,s k (t),...,s x (t)] (6)
the signal strength received by the kth auxiliary base station is defined as:
Figure GDA0004086074100000072
wherein, I ik (t)、l jk (t) the distance between the auxiliary base station and the primary and secondary users at time t, respectively, l 0 (t) represents the reference distance, τ represents the path loss index, and σ (t) represents the average noise power of the system.
At time t, secondary user k is in state s k (t) selecting an action at which time the secondary user will enter s k The next state of (t).
2) Action
Setting the transmission power selected by the secondary users in each time slot as an action value, wherein the transmission power of each secondary user is a discretized value, and each secondary user can select H different transmission values, so that the system model shares H n A selectable action space. The action space is defined as: hierarchical rewards function of high rewards penalty:
A(t)=[P 1 (t),P 2 (t),...,P n (t)] (8)
3) Hierarchical rewards function for high rewards penalty
A very critical issue to enable secondary users to adaptively select the appropriate transmit power to achieve spectrum sharing is to design an efficient reward function. From the perspective of close fitting reality, four indexes are designed to judge the success level of spectrum access of the secondary user. The index is defined as follows:
Figure GDA0004086074100000081
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure GDA0004086074100000082
and->
Figure GDA0004086074100000083
Represents the signal to noise ratio, mu, of any primary user and any secondary user, respectively i Sum mu j Threshold values respectively preset for the primary user and the secondary user, < ->
Figure GDA0004086074100000084
Sum sigma P j Respectively representing the sum of the primary user power and the secondary user transmitting power of any access channel;
whether the signal-to-noise ratio of any primary user is larger than a preset threshold value is defined as the most prerequisite for judging whether the power control is successful, and if the signal-to-noise ratio of any primary user is not larger than the preset threshold value, the Complete Failure (CF) of the spectrum access can be directly judged. If the signal-to-noise ratio of any primary user is greater than the preset threshold, but there is no secondary user whose signal-to-noise ratio is greater than the preset threshold, then this situation is referred to as Secondary Failure (SF). If the signal-to-noise ratio of any primary user is greater than a preset threshold, the signal-to-noise ratio of any secondary user is also greater than the preset threshold, and the primary user transmitting power of all access channels is greater than the sum of the secondary user transmitting powers, then the access mode is called Complete Success (CS). In the CS condition, if only a part of secondary users have higher signal-to-noise ratio than a preset threshold and the rest conditions are unchanged, the access mode is called secondary access Success (SS). The specific formula is expressed as follows:
Figure GDA0004086074100000085
according to the above grading conditions, a bonus function is defined as:
Figure GDA0004086074100000091
in the above, a 1 >10a 2 ,a 3 >10a 4 And grading the reward function according to the spectrum access success condition, giving high rewards to the complete access success of the secondary users, giving high penalties to the complete access failure of the secondary users, and enabling the system to explore the successful access strategy more quickly.
4) Strategy
Defining a main user to transmit power according to the following strategy, wherein the power control strategy is as follows:
Figure GDA0004086074100000092
Figure GDA0004086074100000093
the main user controls the transmitting power in a stepwise updating mode at each time point t under the strategy. Signal to noise ratio gamma of master user i at time t i (t)≤μ i And the master user i predicts the signal-to-noise ratio gamma 'at the time t+1' i (t)≥μ i When the primary user increases the transmitting power; signal to noise ratio gamma of master user i at time t i (t)≥μ i And the master user i predicts the signal-to-noise ratio gamma 'at the time t+1' i (t)≥μ i When the main user reduces the transmitting power; otherwise, the current transmission power is kept unchanged. The signal-to-noise ratio of the predicted t+1 moment of the main user i is as follows:
Figure GDA0004086074100000094
the present disclosure provides a competition depth Q network power control method based on high rewarding penalty, and performs experimental simulation on a Python platform, and the competition depth Q network power control method is hereinafter and experimentally referred to as a Dueling DQN algorithm because the competition depth Q network power control method is a Dueling DQN algorithm for improving a rewarding function. Under the same simulation environment, performance comparison is carried out on the natural DQN algorithm, the double DQN algorithm and the reducing DQN algorithm. Each algorithm will iterate 40000 times, and the performance results for each index will be displayed once every 1000 times. FIG. 4 is a graph comparing the loss functions of three different deep reinforcement learning algorithms, from which it can be seen that all three eventually converge. However, the natural DQN algorithm and the double DQN algorithm are unstable, have large loss fluctuation and slow convergence speed. The reducing DQN algorithm presented herein can converge at a relatively fast speed and the loss value remains in a very small range.
As in fig. 5 and 6, the images are shown for a jackpot and a total throughput for a secondary user trained 40000 times for three different deep reinforcement learning algorithms. Three algorithms can be compared to find: compared with the natural DQN algorithm and the double DQN algorithm, the dueling DQN algorithm provided herein can explore the successful access actions of the secondary user from the 5 th round, starts to obtain positive rewards, and the accumulated rewards continuously keep rising, so that the algorithm can quickly learn the correct actions and has obvious advantages. In addition, the total throughput of the algorithm is maximum and the performance is optimal on the index of the total throughput of the secondary users.
Fig. 7 shows the average transmit power for the next user for three algorithms. In summary, the average transmit power of the natural DQN algorithm is highest, and the average transmit power of the double DQN algorithm is almost 2.0mW or more. While the average transmit power of the dueling DQN algorithm is lowest, mostly at 1.5 and 2.0mW, and a few higher than 2.0mW. Simulation results show that, in combination with the indexes, the dueling DQN algorithm provided herein can ensure that the total throughput of the secondary users is maximum when dynamic power control is performed, and average transmitting power is minimum under the condition of ensuring successful access of the secondary user frequency spectrum, so that power loss can be effectively reduced, and energy is saved.
The same or similar reference numerals correspond to the same or similar components;
the positional relationship depicted in the drawings is for illustrative purposes only and is not to be construed as limiting the present patent;
it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (7)

1. A contention-depth Q network power control method for high rewards penalty, comprising the steps of:
s1: the auxiliary base station collects communication information of the primary user and the secondary user and transmits the obtained information to the secondary user;
at time t, secondary user k is in state s k (t) selecting an action at which time the secondary user will enter s k The next state of (t);
s2: setting the transmitting power selected by the secondary user in each time slot as an action value, and constructing an action space;
setting the transmission power selected by the secondary users in each time slot as an action value, wherein the transmission power of each secondary user is a discretized value, and each secondary user selects H different transmission values, so that H is shared n A selectable action space is defined as:
A(t)=[P 1 (t),P 2 (t),...,P n (t)]
wherein P represents the transmit power;
s3: constructing a grading rewarding function of high rewarding penalty;
s4: constructing a power control strategy;
the specific process of the step S1 is as follows:
because the primary user and the secondary user are in a non-cooperative relationship, the secondary user is connected into a primary user channel in a pad mode, the primary user and the secondary user cannot acquire power transmission strategies of the primary user and the secondary user, an auxiliary base station plays an important role in a signal transmission process, the auxiliary base station is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user, and the state values are as follows assuming that X auxiliary base stations exist in an environment:
S(t)=[s 1 (t),s 2 (t),...,s k (t),...,s x (t)]
the signal strength received by the kth auxiliary base station is defined as:
Figure QLYQS_1
wherein m represents the number of primary users, and n represents the number of secondary users; i represents the ith primary user, j represents the jth secondary user; l (L) ik (t)、l jk (t) the distance between the auxiliary base station and the primary and secondary users at time t, respectively, l 0 (t) represents the reference distance, τ represents the path loss index, and σ (t) represents the average noise power of the system.
2. The method for controlling power of a contention-based deep Q network with high prize penalty according to claim 1, wherein in step S3, four indexes are designed to judge the success level of spectrum access of the secondary user, and the indexes are defined as follows:
Figure QLYQS_2
wherein the symbols are
Figure QLYQS_3
Representing arbitrary, symbol->
Figure QLYQS_4
The representation exists and the symbol sigma represents the summation; p (P) i Indicating the i-th primary user transmit power, P, of the access channel j Representing the j-th secondary user transmit power; gamma ray i Representing the signal-to-noise ratio of the primary user, gamma j Representing the signal-to-noise ratio of the secondary user; />
Figure QLYQS_5
And->
Figure QLYQS_6
Represents the signal to noise ratio, mu, of any primary user and any secondary user, respectively i Sum mu j Threshold values respectively preset for the primary user and the secondary user, < ->
Figure QLYQS_7
Sum sigma P j Respectively representing the sum of the primary user power and the secondary user transmit power of any access channel.
3. The method for controlling power of a contention depth Q network with high bonus penalty according to claim 2, wherein in step S3, whether the signal-to-noise ratio of any primary user is greater than a preset threshold is defined as a most prerequisite for judging whether the power control is successful, and if the signal-to-noise ratio of any primary user is not greater than the preset threshold, it is determined directly that the spectrum access fails completely CF; if the signal-to-noise ratio of any primary user is larger than a preset threshold value, but the signal-to-noise ratio of no secondary user is higher than the preset threshold value, the situation is called as secondary access failure SF; if the signal-to-noise ratio of any primary user is larger than a preset threshold value, the signal-to-noise ratio of any secondary user is also larger than the preset threshold value, and the transmitting power of the primary user of all access channels is larger than the sum of the transmitting powers of the secondary users, the access mode is called complete access success CS; in the condition of successful complete access, if only part of secondary users have higher signal-to-noise ratio than a preset threshold and the rest conditions are unchanged, the access mode is called as secondary access successful SS, and the specific formula is expressed as follows:
Figure QLYQS_8
according to the above grading conditions, a bonus function is defined as:
Figure QLYQS_9
in the above, parameter a 1 、a 2 、a 3 And a 4 Are all constant, a 1 >10a 2 ,a 3 >10a 4 And grading the reward function according to the spectrum access success condition, giving high rewards to the complete access success of the secondary users, giving high penalties to the complete access failure of the secondary users, and enabling the system to explore the successful access strategy more quickly.
4. A contention depth Q network power control method according to claim 3, wherein in step S4, the primary user is defined to transmit power according to the following strategy:
Figure QLYQS_10
Figure QLYQS_11
the main user controls the transmitting power in a stepwise updating mode at each time point t under the strategy.
5. The high priced prize punishment contention-depth-Q network power control method of claim 4, wherein the signal-to-noise ratio γ of the primary user i at time t i (t)≤μ i And the master user i predicts the signal-to-noise ratio gamma 'at the time t+1' i (t)≥μ i When the primary user increases the transmitting power; signal to noise ratio gamma of master user i at time t i (t)≥μ i And the master user i predicts the signal-to-noise ratio gamma 'at the time t+1' i (t)≥μ i When the main user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the predicted t+1 moment of the main user i is as follows:
Figure QLYQS_12
where h represents the channel gain and N represents the ambient noise; h is a ii And h jj Respectively representing channel gains of the ith main user and the jth secondary user, N i (t) and N j (t) representing the ambient noise received by the ith primary user and the jth secondary user, respectively.
6. The contention-depth-Q network power control method of high-priced punishment according to claim 5, wherein the secondary user accesses the channel of the primary user through the underlay and the secondary user has strict requirements in power transmission so as not to affect the normal communication of the primary user; to avoid affecting normal communication of the primary user, the secondary user is required to continuously learn the data information collected from the auxiliary base station, and then the communication transmission task is completed with proper transmitting power; the signal-to-noise ratio is an important index for measuring the link quality, and the signal-to-noise ratio of the ith main user is defined as follows:
Figure QLYQS_13
defining the signal-to-noise ratio of the jth secondary user as:
Figure QLYQS_14
wherein h is ii And h jj Respectively representing channel gains of the ith main user and the jth secondary user, P i (t)、P j (t) and P k (t) the transmission power of the ith primary user, the jth secondary user and the kth secondary user at the moment t respectively, h ij (t)、h ji (t)、h kj (t) represents channel gains between the ith and jth primary users, the jth and ith secondary users, and the kth and jth secondary users, respectively, N i (t) and N j (t) representing the ambient noise received by the ith primary user and the jth secondary user, respectively.
7. The method for power control of a contention-based deep Q network with high prize penalty of claim 6, wherein channel gain, transmit power, etc. are dynamically changed, and according to shannon's theorem, a relationship between throughput and signal-to-noise ratio of the jth secondary user is defined as:
T j (t)=Wlog 2 (1+γ j (t))
in the dynamically-changing system, the optimal power distribution effect of the system is ensured, the signal-to-noise ratio of a main user is higher than a preset threshold value, and the secondary user can adjust the self-transmitting power through continuous learning, so that the total throughput of the secondary user in the whole system is maximized.
CN202110701419.XA 2021-06-23 2021-06-23 Competition depth Q network power control method with high rewarding punishment Active CN113438723B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110701419.XA CN113438723B (en) 2021-06-23 2021-06-23 Competition depth Q network power control method with high rewarding punishment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110701419.XA CN113438723B (en) 2021-06-23 2021-06-23 Competition depth Q network power control method with high rewarding punishment

Publications (2)

Publication Number Publication Date
CN113438723A CN113438723A (en) 2021-09-24
CN113438723B true CN113438723B (en) 2023-04-28

Family

ID=77753705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110701419.XA Active CN113438723B (en) 2021-06-23 2021-06-23 Competition depth Q network power control method with high rewarding punishment

Country Status (1)

Country Link
CN (1) CN113438723B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116321390A (en) * 2023-05-23 2023-06-23 北京星河亮点技术股份有限公司 Power control method, device and equipment
CN117545094B (en) * 2024-01-09 2024-03-26 大连海事大学 Dynamic spectrum resource allocation method for hierarchical heterogeneous cognitive wireless sensor network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110267338A (en) * 2019-07-08 2019-09-20 西安电子科技大学 Federated resource distribution and Poewr control method in a kind of D2D communication
WO2020244906A1 (en) * 2019-06-03 2020-12-10 Nokia Solutions And Networks Oy Uplink power control using deep q-learning

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102355738B (en) * 2011-06-29 2014-05-07 中国人民解放军理工大学 Cognitive SIMO (Single Input Multiple Output) network access method on basis of cooperative relay
CN102355729B (en) * 2011-06-29 2014-03-12 中国人民解放军理工大学 Maximum throughput resource distribution method in cooperative and cognitive single-input multiple-output (SIMO) network
CN109803344B (en) * 2018-12-28 2019-10-11 北京邮电大学 A kind of unmanned plane network topology and routing joint mapping method
CN111262638B (en) * 2020-01-17 2021-09-24 合肥工业大学 Dynamic spectrum access method based on efficient sample learning
CN111726811B (en) * 2020-05-26 2023-11-14 国网浙江省电力有限公司嘉兴供电公司 Slice resource allocation method and system for cognitive wireless network
CN112367132B (en) * 2020-10-27 2021-12-24 西北工业大学 Power distribution algorithm in cognitive radio based on reinforcement learning solution

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020244906A1 (en) * 2019-06-03 2020-12-10 Nokia Solutions And Networks Oy Uplink power control using deep q-learning
CN110267338A (en) * 2019-07-08 2019-09-20 西安电子科技大学 Federated resource distribution and Poewr control method in a kind of D2D communication

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蒋涛涛 ; 朱江 ; .CNR中基于多用户Q学习的联合信道选择和功率控制.计算机应用研究.2020,(08),全文. *

Also Published As

Publication number Publication date
CN113438723A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
EP1805909B1 (en) Transmit power control techniques for wireless communication systems
CN113438723B (en) Competition depth Q network power control method with high rewarding punishment
CN103796211B (en) Joint Power and method for channel allocation in a kind of cognition wireless network
CN110267274B (en) Spectrum sharing method for selecting sensing users according to social credibility among users
CN109788566A (en) Network resource allocation method based on depth enhancing study
Ma et al. On-demand resource management for 6g wireless networks using knowledge-assisted dynamic neural networks
Dai et al. Contextual multi-armed bandit for cache-aware decoupled multiple association in UDNs: A deep learning approach
Mei et al. Semi-decentralized network slicing for reliable V2V service provisioning: A model-free deep reinforcement learning approach
Guo et al. Throughput Maximization for RF Powered Cognitive NOMA Networks with Backscatter Communication by Deep Reinforcement Learning
Sonti et al. Enhanced fuzzy C‐means clustering based cooperative spectrum sensing combined with multi‐objective resource allocation approach for delay‐aware CRNs
Chang et al. Fuzzy/neural congestion control for integrated voice and data DS-CDMA/FRMA cellular networks
Cui et al. Hierarchical learning approach for age-of-information minimization in wireless sensor networks
Wang et al. Completion time minimization in NOMA systems: Learning for combinatorial optimization
Joshi et al. Optimized fuzzy power control over fading channels in spectrum sharing cognitive radio using ANFIS
Khan et al. Sum throughput maximization scheme for NOMA-Enabled D2D groups using deep reinforcement learning in 5G and beyond networks
Balcı et al. Fairness aware deep reinforcement learning for grant-free NOMA-IoT networks
Wu et al. Q-learning based link adaptation in 5G
Zhou et al. Neural network joint capacity-power control strategy based on NSGAII-BP for interference suppression in LEO satellite uplinks
CN113890653A (en) Multi-agent reinforcement learning power distribution method for multi-user benefits
Tariq et al. Auto-tuning for cellular scheduling through bandit-learning and low-dimensional clustering
CN113395757A (en) Deep reinforcement learning cognitive network power control method based on improved return function
Periola et al. A robust conscious model for enhancing cognitive radio quality of service
Do et al. Dynamic bandwidth allocation scheme for wireless networks with energy harvesting using actor-critic deep reinforcement learning
CN105007582A (en) Dynamic resource allocation method for controlled wireless network system based on POMDP
Ali et al. Deep-Q Learning for Distributed Spectrum Reuse Maximization in Underlay Cognitive Radio Systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231212

Address after: 518021 A807, Jihao Building, No. 1086 Shennan East Road, Fenghuang Community, Huangbei Street, Luohu District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen Tuo Ai Wei Information Technology Co.,Ltd.

Address before: 510090 Dongfeng East Road 729, Yuexiu District, Guangzhou City, Guangdong Province

Patentee before: GUANGDONG University OF TECHNOLOGY