CN112954814B

CN112954814B - Channel quality access method in cognitive radio

Info

Publication number: CN112954814B
Application number: CN202110107271.7A
Authority: CN
Inventors: 叶方; 张音捷; 李一兵; 孙骞; 田园
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2022-05-20
Anticipated expiration: 2041-01-27
Also published as: CN112954814A

Abstract

The invention provides a channel quality access method in cognitive radio, which comprises the following specific steps: the local networks comprise an actor network and a critic network, wherein the actor network is responsible for channel selection and interacts with the environment to collect interactive information, the critic network evaluates the advantages and disadvantages of the actor network channel selection strategy, but the local networks do not update the gradient and collect the gradient and transmit the gradient to the global network, the global network does not interact with the environment, gathers the gradients collected by the local networks, performs gradient update on the local networks, and transmits the updated network parameters to the local networks again. The invention comprehensively considers the channel quality and the idle probability, and the secondary user can effectively avoid accessing the inferior channel, thereby greatly improving the access success rate meeting the service quality requirement.

Description

Channel quality access method in cognitive radio

(I) the technical field

The invention belongs to the technical field of communication, in particular relates to a cognitive radio communication technology, and particularly relates to a channel quality access method in cognitive radio.

(II) background of the invention

With the popularization of 4G/5G networks, mobile devices are increasing, and diverse disciplines such as cloud computing, Internet of things and artificial intelligence are generated, so that emerging communication services emerge endlessly. However, wireless spectrum has become increasingly scarce as the basis for the operation of various types of communication services under existing spectrum planning management. The existing spectrum allocation mode has exclusivity and exclusivity, and even if an authorized user does not use the allocated frequency band, other users cannot use the allocated frequency band. The cognitive radio uses the authorized frequency band in a dynamic spectrum access mode, and provides a brand new scheme for improving the spectrum utilization rate on the premise of not causing harmful interference to authorized users/main users. The channel accessed by the secondary user sensing directly influences the sensing delay, transmission performance and other aspects of the secondary user, and the research thereof is imminent, and the channel accessed by the secondary user sensing is one of the key factors for improving the performance of the cognitive radio system.

The existing channel access algorithm adopts sequential detection access, determines a sensing sequence before sensing and senses according to the defined sensing sequence. And sequentially detecting the access under the condition of knowing some channel environment prior information, such as channel idle probability, a master user occupation rule, a channel signal-to-noise ratio and other information, and designing a channel sensing access sequence. Although sequential detection access is simple in design, it requires knowledge of most of the environment a priori, which is difficult to implement in a practical environment. The performance of the sequential detection algorithm is easily influenced by 'poor channels' in the environment, and although the idle degree of the channels is high, the signal-to-noise ratio is low; or the primary user occupies the channel frequently although the channel is large. If the signal-to-noise ratio sequential detection algorithm is based on, a channel with a high signal-to-noise ratio but frequent occupation of a master user is easy to select, so that the perception access success rate is low; or the sequential detection algorithm based on the channel idle probability is easy to select the channel with high idle degree but low signal-to-noise ratio, which causes the result that the secondary user does not meet the service quality requirement and the throughput obtained by the secondary user is low.

The deep reinforcement learning has excellent success in the fields of electronic games, robots, go and the like, and can interact with the environment to learn on the premise of losing most of prior information of the environment, so that intelligent decision is made. The invention introduces the network of the asynchronous dominant actor appraisal family in deep reinforcement learning into the cognitive radio, so that the secondary user can intelligently select the channel meeting the self service quality requirement for perception access under the condition of unknown most channel environment prior information.

Disclosure of the invention

The invention aims to provide a method for detecting the interference of a low-quality channel in the environment, which can overcome the defect and the defect that a sequential detection algorithm is easily interfered by the low-quality channel in the environment; and intelligently selecting the channel meeting the self service quality requirement for perception access by a method of unknown most channel environment prior information.

The purpose of the invention is realized as follows:

1.1, initializing actor network and comment family network parameters in the global network, and giving the global network parameters to the local network;

1.2, under the local network, the secondary user selects a channel to access according to an observation matrix formed by observation information and a current strategy, and the secondary user perceives the selected channel to access and obtains instant return according to the channel state;

1.3, after a plurality of iterations, respectively calculating gradients of the local actor network and the local commenting family network, transmitting the gradients to the global network, and resetting the gradients of the local actor network and the commenting family network;

1.4, updating the global actor network according to the actor network updating function, updating the global critic network according to the critic network updating function, and endowing the updated global network parameters to a local network;

and 1.5, circularly executing the step 1.2 to the step 1.4 until all the circulation times are finished, and obtaining a complete neural network model.

The asynchronous dominant actor critic network comprises the following main contents: the asynchronous dominant actor critic network is divided into two major parts, a global network and a local network. The neural network structure of the global network and the local network is the same, wherein the actor network is a hidden layer, the number of the neurons is 200, and the activation function is a linear rectification function. The critic network is also a hidden layer, the number of the neurons is 200, and the activation function is a linear rectification function.

The main contents of the local network included in the invention are as follows: each local network is independently interacted with the environment, so that each local network also has an independent actor network and a critic network, the local actor networks are independently interacted with the channel environment respectively, the critic network evaluates the action strategy of the actor networks, and the network structures of the local networks are completely the same.

The observation matrix comprises the following main contents: the method is characterized in that: the secondary user can only observe the state of the selected sensing channel, and the observation information of the secondary user in the t-th time slot is as follows:

O_t＝[o_1,t,o_2,t,...,o_N,t]

after a temporary memory mechanism is introduced, the secondary user can store the observation information of the previous M steps. The M-step observation information forms an observation matrix, and the observation matrix at the t-th moment can be expressed as:

S_t＝[O_t-1,O_t-2,O_t-3,...,O_t-M]

the interactive return function comprises the following main contents: the secondary user selects to sense that the accessed channel is idle and meets the self service quality requirement, so that the decision is correct, and positive feedback is obtained; if the channel selected and sensed by the secondary user is occupied by the primary user, the decision is wrong, and a negative feedback punishment is received. Considering that channels meeting the service quality requirement of the secondary user are all in a busy state in a certain period of time, the channel selected and sensed by the secondary user is set to be an idle channel although the channel does not meet the service quality requirement, and a small positive feedback can still be obtained.

D_iRepresenting the obtained throughput of the ith channel, with η being the throughput threshold of the secondary user. (D)_i- η)/η is the ratio of the throughput obtained for the ith channel to the threshold η difference, mainly to guide the secondary user to select the more excellent channel.

The invention comprises a global network which mainly comprises the following contents: the global network does not interact with the environment, and the method mainly works by collecting gradient data of each network, updating the network through the gradient data and transmitting updated network parameters to each local network.

The update function for the global actor network is:

where θ represents a parameter of the global actor network; a (s, a) represents a merit function representing the degree of superiority and inferiority of the operation in the environmental state; h (Pi)_θ'(s)) is a policy entropy for increasing exploratory power of previous users; (ii) a Beta represents a policy entropy weight for controlling the degree of exploration.

The update function for the global critic network is:

where μ represents a parameter of the global critic network; r represents the instant reward obtained by the secondary user; gamma is a discount factor; λ is the learning rate of the critic network.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention comprehensively considers the signal-to-noise ratio and the idle probability of the channel, can effectively avoid poor channels in the environment and effectively improve the success rate of accessing the high-quality channel by the secondary user;

2. the return function of the invention is set to encourage the secondary user to access a more excellent channel on the premise of meeting the QoS, so that the secondary user can be guided to make a better decision;

3. the method is close to the access success rate of the known prior information algorithm under the condition of missing most of the environmental prior information, and is higher than the access success rate of part of the known prior information algorithm when the sensing times are less.

(IV) description of the drawings

FIG. 1 is a flow chart of the algorithm of the present invention;

FIG. 2 shows the number of selections of different channels in each cycle;

fig. 3 the present invention compares sequence-aware access success rates with different known apriori information.

(V) detailed description of the preferred embodiments

The following detailed description is made with reference to the accompanying drawings and specific examples:

the final objective of the algorithm of the invention is that the secondary user can intelligently select an idle channel which accords with the self service quality according to the learned channel access strategy for perception access, and abstract the idle channel into reinforcement learning, namely the strategy adopted by the intelligent agent can maximize the accumulated return. The communication of the user in a single circulation can also be carried out infinitely along with the time, the accumulated return tends to be infinite, and the quality of the strategy cannot be effectively evaluated. Thus defining the number of slots in a single iteration as T. The above problem can be expressed as the following formula:

wherein r is_i,tIndicating the instantaneous reward obtained by selecting the ith channel at time t.

The invention sets that N channels and a secondary user exist in the environment, the states of the N channels are all time-varying, and the channel state is only related to the occupation of the primary user; setting that a secondary user can sense N (N < < N) channels in a time slot, wherein in the t-th time slot, environment information which can be observed by the secondary user is as follows:

O_t＝[o_1,t,o_2,t,...,o_N,t] (2)

wherein o is_i,tObservation information representing the i-th channel at the time t of the secondary user:

wherein x_i,tThe channel state of the ith channel at time t. After a temporary memory mechanism is introduced, the secondary user can store the observation information of the previous M steps. The M-step observation information forms an observation matrix, and the observation matrix at the t-th moment can be expressed as:

S_t＝[O_t-1,O_t-2,O_t-3,...,O_t-M] (4)

and the secondary user selects the sensing access which best meets the self QoS requirement after sensing the n channels. The number of elements of the action set when selecting n channel senses is:

if two channels are selected for sensing in a single time slot when 5 channels exist in the environment, the action set is a { (1,2), (1,3), (1,4),. ·, (4,5) }. If only one channel can be sensed in a single timeslot, the action set is the number of channels existing in the environment:

A＝{1,2,3,...,N} (6)

if the secondary user selects to sense that the accessed channel is idle and meets the self service quality requirement, the decision is correct, and positive feedback is obtained; if the channel selected and sensed by the secondary user is occupied by the primary user, the decision is wrong, and a negative feedback punishment is received. Considering that channels meeting the service quality requirement of the secondary user are all in a busy state in a certain period of time, the channel selected and sensed by the secondary user is set to be an idle channel although the channel does not meet the service quality requirement, and a small positive feedback can still be obtained. The reward function may be represented by the following equation:

the quality of service requirement of the secondary user is determined by the throughput, and the quality of service is determined to be qualified only if the obtained throughput of the access channel is higher than the threshold requirement. D_iRepresenting the obtained throughput of the ith channel, with η being the throughput threshold of the secondary user. (D)_i- η)/η is the ratio of the throughput obtained for the ith channel to the threshold η difference, mainly to guide the secondary user to select the more excellent channel.

The network of the asynchronous dominant actor commentary family is divided into a local network and a global network. Each local network is independently interacted with the environment, so that each local network also has an independent actor network and a critic network, the local actor networks are independently interacted with the channel environment respectively, the critic network evaluates the action strategy of the actor networks, and the network structures of the local networks are completely the same. The global network does not interact with the environment, and the method mainly works by collecting gradient data of each network, updating the network through the gradient data and transmitting updated network parameters to each local network.

The actor network in the local network performs interaction and action selection with the environment, and the main task is strategy learning, which directly performs gradient calculation on the strategy:

wherein J (θ) represents an objective function of the policy network; pi_θ(s, a) represents the probability of selecting action a in state s when the network parameter is θ; d(s) representing the number of states collected for this interaction;

representing the immediate reward obtained by selecting action a in state s.

The local critic network is mainly used for estimating state value, evaluating the quality degree of an actor network action strategy and guiding actor network updating through an advantage function. The merit function is the merit of some action a over the average in state s. Multi-step sampling is employed in the asynchronous dominant actor critic network to accelerate convergence:

where V(s) represents the value of state s, which can be estimated by the critic's network. In conjunction with equation (9), the policy gradient calculation of equation (8) becomes:

the global network does not interact with the environment, and the method mainly works by collecting gradient data of each network, updating the network through the gradient data and transmitting updated network parameters to each local network. The structure of the global network also remains consistent with the local network due to the mutual communication of parameters and gradients. The actor network in the global network is also responsible for updating the action strategy, and the gradient update can be expressed as:

where θ represents a parameter of the global actor network; a (s, a) represents a merit function representing the degree of superiority and inferiority of the operation in the environmental state; h (Pi)_θ'(s)) is a policy entropy for increasing exploratory power of previous users; beta represents a policy entropy weight for controlling the degree of exploration. After the dominance function is introduced, the global network critics network improves the fitting accuracy of the value function by minimizing the square of the dominance function, and the gradient update of the global network critics network can be expressed as:

The simulation parameters of the simulation example of the invention are set as follows: the simulation parameters are divided into two parts of system environment parameters and neural network parameters. Wherein the system environment parameters are: in the environment, there are N-10 independent channels, each of which may be occupied by a primary user, and the occupation probability P is_busyIs (0,1), the signal-to-noise ratio of the channel ranges from [ -10,10 [)]dB. In simulation experiments, the signal-to-noise ratio of 10 channels is set to be SNR [ -10, -8, -9, -5, -3,0,4,5,7,10 [ -8, -9, -5 [ -3]Corresponding to an occupation probability of P_busy＝[0.1,0.3,0.4,0.3,0.2,0.5,0.3,0.4,0.4,0.9]. The neural network parameters are: the network structures of actors and commentators of the local network and the global network are the same, the actor network is a hidden layer, the number of neurons is 200, and an activation function is a linear rectification function; the output layer directly outputs the action selection probability distribution. The critic network is also a hidden layer, the number of the neurons is 200, and the activation function is a linear rectification function; the output layer outputs an estimate of the value of the state action. The learning rate of the critic network is required to be more than or equal to that of the actor network, and the learning rate Lr of the critic network is set by the method_c0.001, learning rate Lr of actor network_a0.0001. The invention sets the access success rate as the probability of the secondary user successfully accessing the idle channel which accords with the service quality.

As shown in fig. 2, there are 3 channels meeting QoS requirements in the environment, and the three channels are selected by the secondary user to sense the access times when sensing once every time slot. As can be seen in the figure, the iteration is initially due to heuristics, and three channels are selected almost the same number of times. However, as the iteration progresses, although the signal-to-noise ratio of the 10 th channel is high, the occupied frequency of the primary user is high, and therefore the number of times of selection is also reduced continuously. The learning of the secondary user considers the channel access from a longer angle, so that the poor channel can be effectively avoided. The other two channels that meet the QoS requirements are selected slowly since their primary users occupy less frequency. Meanwhile, due to the arrangement of the return function, the secondary user is biased to access the 9 th channel under the condition that the occupation probability of the primary user is not large, which shows that the arrangement of the return function can guide the secondary user to make a better decision.

As shown in fig. 3, when there are 3 channels in the environment meeting the QoS requirement, the present invention compares the access success rate with the sequence sensing of different known prior information under different sensing times. The fully-known perception is that the algorithm assumes that the signal-to-noise ratio of all channels known by secondary users and the occupation probability of the primary user corresponding to each channel are calculated according to the product (SNR (1-P) of the signal-to-noise ratio and the idle probability of the primary user_busy) For sequence perception. It can be seen from the figure that the fully-known sensing always senses a fixed channel due to the characteristics of the sequence sensing, so that the access success rate of the fully-known sensing depends on the first sensing channel under the condition of sensing for 1 time, and the sensing access algorithm provided by the invention can intelligently select a proper channel for access without being limited to the sequence sensing access.

The invention provides a channel quality access method in cognitive radio, which comprises the following specific steps: the local network has an actor network and a critic network, the actor network is responsible for channel selection and interacts with the environment to collect interaction information, the critic network evaluates the advantages and disadvantages of actor network channel selection strategies, but the local network does not update gradients, but collects the gradients and transmits the gradients to the global network, the global network does not interact with the environment, the global network collects the gradients collected by the local networks, performs gradient updating on the local networks, and transmits updated network parameters to the local networks again. The invention comprehensively considers the channel quality and the idle probability, and the secondary user can effectively avoid accessing the inferior channel, thereby greatly improving the access success rate meeting the service quality requirement.

The technical solution of the present invention is not limited to the technical method, and the present invention can be extended to other modifications, variations, applications and embodiments in application, and all such modifications, variations, applications, embodiments are considered to be within the spirit and teaching scope of the present invention.

Claims

1. A channel quality access method in cognitive radio is characterized in that: the method comprises the following steps:

representing the instant reward obtained by selecting action a under state s;

the local critic network is mainly used for estimating state value, evaluating the quality degree of an actor network action strategy, and guiding the actor network to update through an advantage function, wherein the advantage function is the advantage of a certain action a relative to the average under a state s, and multistep sampling is adopted in the asynchronous dominant actor critic network to accelerate convergence:

A(s,a)＝Q(s,a)-V(s)

＝r_t+1+γr_t+2+...+γ^n-1r_t+n+γⁿV(s')-V(s)

where v(s) represents the value of state s, which can be estimated by the critic network, and the gradient calculation for the strategy becomes:

the update function for the global actor network is:

where θ represents a parameter of the global actor network, A (s, a) represents a merit function representing a degree of goodness of the action under the environment condition, and H (π)_θ'(s)) is a policy entropy for increasing exploratory power of previous users;

the update function for the global critic network is:

wherein mu represents the parameters of the global critic network, r represents the instant return obtained by the secondary user, gamma is a discount factor, and lambda is the learning rate of the critic network;

2. The method of claim 1, wherein the method comprises: in the environment, a plurality of channels can be accessed, and the secondary user quickly finds and accesses the channel which meets the self service quality requirement.

3. The method of claim 1, wherein the method comprises: step 1.1, the neural networks of the global network and the local network have the same structure, wherein the actor network is a hidden layer, the number of the neurons is 200, the activation function is a linear rectification function, the critic network is a hidden layer, the number of the neurons is 200, and the activation function is a linear rectification function.

4. The method of claim 1, wherein the method comprises: in the step 1.2, each local network is independently interacted with the environment, the actor network and the comment family network are independent respectively, the local actor networks are independently interacted with the channel environment respectively, the comment family network evaluates the action strategy of the actor network, and the network structures of the local networks are completely the same.

5. The method of claim 1, wherein the method comprises: in the observation matrix in step 1.2, the secondary user can only observe the state of the selected sensing channel, and the observation information of the secondary user at the t-th time slot is as follows:

O_t＝[o_1,t,o_2,t,...,o_N,t]

after a temporary memory mechanism is introduced, the secondary user stores the observation information of the previous M steps, the observation information of the M steps forms an observation matrix, and the observation matrix at the t-th moment can be expressed as:

S_t＝[O_t-1,O_t-2,O_t-3,...,O_t-M]

obtaining a return after interacting with the environment, wherein the return function is as follows:

the secondary user selects to sense that the accessed channel is idle and meets the self service quality requirement, so that the decision is correct, and positive feedback is obtained; if the channel selected and sensed by the secondary user is occupied by the primary user, the decision error is indicated, and a negative feedback punishment is received; considering that channels meeting the service quality requirement of the secondary user are all in a busy state in a certain period of time, setting the channels selected and sensed by the secondary user to be idle channels although the channels do not meet the service quality requirement, and still obtaining a small positive feedback, D_iRepresenting the obtained throughput of the ith channel, η being the throughput threshold of the secondary user, (D)_i- η)/η is the ratio of the throughput obtained for the ith channel to the threshold η difference, mainly to guide the secondary user to select the more excellent channel.