CN112512121A

CN112512121A - Radio frequency spectrum dynamic allocation method and device based on reinforcement learning algorithm

Info

Publication number: CN112512121A
Application number: CN202011432004.9A
Authority: CN
Inventors: 林霏; 曹士龙; 林硕; 张帮谦; 刘玉英; 刘璨; 刘康
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-16

Abstract

The application discloses a radio frequency spectrum dynamic allocation method and device based on a reinforcement learning algorithm, which are used for solving the problem of insufficient utilization of a frequency spectrum due to idle frequency spectrum, so that the frequency spectrum resources are tense. The method comprises the steps of constructing a radio communication system model, and determining assignable radio frequency bands and bandwidths; determining key performance indexes and corresponding priorities corresponding to different communication scenes; and respectively allocating corresponding frequency bands and bandwidths to different communication scenes by utilizing a reinforcement learning and deep neural network according to the key performance indexes and priorities of the communication scenes. In the method, the DQN algorithm is applied to a frequency band allocation strategy in a multi-communication scene in a 5G Internet of things environment, flexible allocation of radio frequency bands is realized, and the utilization rate of a 5G frequency spectrum and the reliability of a communication system can be improved.

Description

Radio frequency spectrum dynamic allocation method and device based on reinforcement learning algorithm

Technical Field

The present invention relates to the field of spectrum allocation technologies, and in particular, to a radio spectrum dynamic allocation method and apparatus based on a reinforcement learning algorithm.

Background

With the acceleration of the 5G commercialization process, spectrum has become an important constraint that restricts wireless communication systems. There are two approaches to solve this limitation in 5G: first, the allocated frequencies are released for use by the 5G system. But the business of 5G does not completely abandon the 4G technology, but upgrades it on its basis, and cannot provide enough bandwidth to the system by releasing it, in the traditional fixed allocation manner. Second, communication is performed using a higher millimeter wave band. However, the severe propagation attenuation in the high frequency band limits the coverage of the system, so that it can only be used as a supplementary measure.

In addition, a large cause of shortage of spectrum resources is spectrum utilization. A large amount of allocated spectrum resources are left unused to a large extent both in time and space, and are not fully utilized.

Therefore, there is a need for a dynamic allocation method of radio spectrum that optimizes and makes full use of spectrum allocation.

Disclosure of Invention

The embodiment of the application provides a radio frequency spectrum dynamic allocation method and device based on a reinforcement learning algorithm, and aims to solve the problem that spectrum resources are insufficient due to the fact that frequency spectrums are idle and are not fully utilized.

The radio frequency spectrum dynamic allocation method based on the reinforcement learning algorithm provided by the embodiment of the application comprises the following steps:

constructing a radio communication system model, and determining assignable radio frequency bands and bandwidths;

determining key performance indexes and corresponding priorities corresponding to different communication scenes;

and respectively allocating corresponding frequency bands and bandwidths to different communication scenes by utilizing a reinforcement learning and deep neural network according to the key performance indexes and priorities of the communication scenes.

In one example, the method for allocating corresponding frequency bands and bandwidths to different communication scenes by using the reinforcement learning and deep neural network according to the key performance indexes and priorities of the communication scenes specifically includes: generating agents of different initial positions corresponding to different communication scenes through a deep Q network, and respectively determining targets corresponding to the communication scenes from all radio frequency bands and bandwidths; and determining the radio frequency band allocated to each communication scene through a deep Q network, and allocating the determined radio frequency band to the corresponding communication scene according to the priority of each communication scene when determining that the radio frequency band belongs to the allocable radio frequency band and the bandwidth.

In one example, determining a target corresponding to each communication scenario specifically includes: determining a first target and a second target corresponding to each communication scene; wherein the priority of the first target is higher than the priority of the second target; according to the priority of each communication scenario, allocating the radio frequency band corresponding to the target to the corresponding communication scenario, specifically including: according to the priority of each communication scene, sequentially distributing radio frequency bands to each communication scene; aiming at each communication scene, distributing a radio frequency band to the communication scene according to the priority of a plurality of targets corresponding to the communication scene; and determining that the radio frequency band corresponding to the first target of the communication scene is allocated, and allocating the radio frequency band corresponding to the second target to the communication scene.

In one example, the method for allocating corresponding frequency bands and bandwidths to different communication scenes by using the reinforcement learning and deep neural network according to the key performance indexes and priorities of the communication scenes specifically includes: determining radio frequency bands allocated to each communication scene through state input and Q value output in a deep Q network; and determining whether the radio frequency band allocated to each communication scene meets the target corresponding to each communication scene, and determining the corresponding reward value.

In one example, the communication scenario includes a car-to-ambient V2X, a mobile user, a device-to-device D2D; the V2X communication scenario has a higher priority than the mobile user communication scenario, which has a higher priority than the D2D communication scenario.

In one example, the key performance indicators of the V2X communication scenario are latency, capacity; respectively allocating corresponding frequency bands and bandwidths to different communication scenes, specifically comprising: and determining the radio frequency band determined by the reinforcement learning as the highest frequency band in the assignable radio frequency band and the bandwidth, and assigning the highest frequency band to the V2X communication scene.

In one example, the key performance indicators of the mobile user communication scenario are reliability, capacity of channel data transmission; respectively allocating corresponding frequency bands and bandwidths to different communication scenes, specifically comprising: and determining the radio frequency band determined by the reinforcement learning as the maximum bandwidth in the allocable radio frequency band and the bandwidth, and allocating the maximum bandwidth to the communication scene of the mobile user.

In one example, the key performance indicators of the D2D communication scenario are capacity, reliability of channel data transmission; respectively allocating corresponding frequency bands and bandwidths to different communication scenes, specifically comprising: and determining that the radio frequency band determined by reinforcement learning meets the characteristic of short terminal interval in the D2D communication scene, and allocating the radio frequency band to the mobile user communication scene.

In one example, the method further comprises: determining the communication quality of a channel connected with a terminal under different communication scenes; and under the condition that the communication quality is lower than a preset value, a radio frequency band and a bandwidth are newly allocated to the corresponding communication scene through the reinforcement learning and deep neural network.

The radio frequency spectrum dynamic allocation device based on the reinforcement learning algorithm provided by the embodiment of the application comprises:

the building module is used for building a radio communication system model and determining assignable radio frequency bands and bandwidths;

the determining module is used for determining key performance indexes and corresponding priorities corresponding to different communication scenes;

and the distribution module is used for respectively distributing corresponding frequency bands and bandwidths to different communication scenes according to the key performance indexes and the priorities of the communication scenes by utilizing the reinforcement learning and deep neural network.

The embodiment of the application provides a radio frequency spectrum dynamic allocation method and device based on a reinforcement learning algorithm, which at least have the following beneficial effects: in the 5G Internet of things environment, the DQN algorithm is applied to frequency band allocation strategies in a V2X communication scene, a mobile user data transmission communication scene and a D2D relay transmission communication scene, flexible allocation of radio frequency bands is achieved, and the utilization rate of a 5G frequency spectrum and the reliability of a communication system can be improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of a cognitive cycle provided in an embodiment of the present application;

fig. 2 is a flowchart of a radio spectrum dynamic allocation method based on a reinforcement learning algorithm according to an embodiment of the present application;

FIG. 3 is a schematic view of V2X provided in the embodiments of the present application;

fig. 4 is a schematic diagram of D2D communication classification provided in an embodiment of the present application;

fig. 5 is a schematic diagram of a radio communication system model provided in an embodiment of the present application;

fig. 6(1) is a graph of learning time when the learning rate is 0.01 and the replacement target is 200 according to the embodiment of the present application;

fig. 6(2) is a graph of learning time when the learning rate is 0.01 and the alternative objective is 500 according to the embodiment of the present application;

fig. 6(3) is a graph of learning time when the learning rate is 0.01 and the alternative objective is 800 according to the embodiment of the present application;

fig. 6(4) is a graph of learning time when the learning rate is 0.03 and the alternative objective is 200 according to the embodiment of the present application;

fig. 6(5) is a graph of learning time when the learning rate is 0.03 and the alternative objective is 500 according to the embodiment of the present application;

fig. 6(6) is a graph of learning time when the learning rate is 0.03 and the alternative objective is 800 according to the embodiment of the present application;

fig. 6(7) is a graph of learning time when the learning rate is 0.06 and the replacement target is 200 according to the embodiment of the present application;

fig. 6(8) is a graph of learning time when the learning rate is 0.06 and the replacement target is 500 according to the embodiment of the present application;

fig. 6(9) is a graph of learning time when the learning rate is 0.06 and the replacement target is 800 according to the embodiment of the present application;

fig. 7 is a schematic structural diagram of a radio spectrum dynamic allocation apparatus based on a reinforcement learning algorithm according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Cognitive radio systems typically have two basic users: authorized users (LUs) and Cognitive Users (CU), spectrum resource sharing between LU and CU is the core idea of Cognitive radio. Under the premise of not generating interference to LUs with legal spectrum usage right, CUs opportunistically access the spectrum and do not generate interference to LUs by sensing the surrounding radio environment to improve the utilization rate of the spectrum. The technology realizes the access of a plurality of frequency bands through a dynamic spectrum allocation technology, and makes full use of the idle spectrum.

Cognitive radio exhibits the ability to interact with its environment in real time to determine when to communicate parameters and adapt to the dynamic wireless environment. This task requires adaptive operation in the open spectrum, as shown in fig. 1, which is referred to as cognitive cycling. The cognitive cycle comprises three main steps: spectrum sensing, spectrum analysis and spectrum decision.

Once the operating frequency band is selected, communication may be conducted over that frequency band. However, since the cognitive radio environment is constantly changing over time and space, the cognitive radio must track the change of the radio environment. The spectrum mobility management function provides a seamless transmission if the currently active frequency band becomes unavailable. Any environmental change during the transmission, such as the presence of a primary user, the movement of a user, or a change in traffic, can cause such adjustments.

Spectrum sensing is a prerequisite for spectrum sharing and is also the basis for cognitive radio applications. The spectrum sensing is to continuously perform spectrum detection on a frequency band allocated to a main user in a time domain, a frequency domain and a space domain multidimensional space, so as to obtain the use condition of a spectrum. The existing mainstream spectrum sensing technology comprises matched filtering detection, energy detection and the like.

Spectrum sharing techniques based on access technology may be classified into Overlay spectrum sharing and Underlay spectrum sharing. The Overlay spectrum shares a part of spectrum which is not used by authorized users to access the network, so that the interference to a main user system can be minimized. Underlay spectrum sharing takes advantage of increased bandwidth using spread spectrum techniques that have been developed for cellular networks. Overlay performs better than Underlay when the interference between users is very large or given full system awareness. The goal of dynamic spectrum access is to find available primary user free spectrum opportunities so that cognitive users can opportunistically access to communicate.

According to the method, a Deep Q Network (DQN) algorithm in Reinforcement Learning (RL) is used for research of multi-user dynamic spectrum access, aiming at complex communication environment, various parameters and diversified user states of a cognitive radio Network, the advantage that DQN in the Reinforcement learning algorithm processes a large number of complex parameters is combined, a 5G cognitive radio communication scene is provided, strategy research of dynamic spectrum allocation in cognitive radio is realized by utilizing python speech compiling, dynamic spectrum access performance is optimized, full utilization of spectrum is realized, and the problem of spectrum resource shortage is relieved.

Fig. 2 is a flowchart of a radio spectrum dynamic allocation method based on a reinforcement learning algorithm according to an embodiment of the present application, which specifically includes the following steps:

s201: and constructing a radio communication system model, and determining assignable radio frequency bands and bandwidths.

The radio communication system comprises free radio frequency bands, which can be reallocated to the respective terminals. The terminal comprises various devices such as a mobile phone, a vehicle-mounted communication device and a wireless broadcast under different communication scenes.

S202: and determining key performance indexes and corresponding priorities corresponding to different communication scenes.

In the embodiment of the present application, the communication scenario may include vehicle to outside (V2X), mobile user, device to device (D2D), and the like. Wherein the priority of the V2X communication scenario is higher than the priority of the mobile user communication scenario, which is higher than the priority of the D2D communication scenario.

V2X is that the vehicle is connected with all things, is the direct connection communication technology of intelligent networking car. In brief, the vehicle model matched with the system can automatically select the driving route with the best road condition through analyzing the real-time traffic information in an automatic driving mode, thereby greatly relieving traffic jam. As shown in fig. 3, V2X includes vehicle to vehicle (V2V), vehicle to human (V2P), vehicle to facility (V2I), and the like.

At present, a 4G network is mainly adopted for information communication on the Internet of vehicles, and the speed of processing images and calculating an algorithm of the vehicles is delayed, so that the delay of controlling a decision command of the vehicle is caused, and the safety performance of the vehicle is reduced. The advent of 5G networks solved this problem well. Because the 5G network has the advantages of low time delay, wide bandwidth and high speed, the driving vehicle improves the information processing speed and can help the vehicle accelerate the speed of road identification, safety identification and obstacle detection, the driving safety of the vehicle is improved, and the safety of a driver and passengers is guaranteed.

Radio spectrum is a key resource for intelligent networked automobiles, and applying 5G to V2X is a challenge to spectrum utilization. At present, various countries around the world have set policies to divide the frequency band of 5.9GHz into V2X to meet the requirements of low delay and high rate, thereby greatly improving the security.

D2D is similar to Machine (M2M) in the internet of things, and D2D aims to enable user communication devices within a certain distance range to communicate directly to reduce the load on the serving base station. The requirement for the frequency band is generally in an ISM (industrial Scientific Medical band) frequency band, currently, the 2.4GHz frequency band is divided into ISM frequency bands in various countries, and the frequency bands are applied without license or cost, only a certain transmission power (generally lower than 1 w) needs to be observed, and interference to other frequency bands is not needed.

As shown in fig. 4, D2D communication is divided into Centralized control (Centralized control) and Distributed control (Distributed control). The centralized control is that the base station controls D2D connection, and the base station obtains all link information through the measurement information reported by the terminal, but this type increases the signaling load. The distributed control is autonomously performed by the D2D device, and the establishment and maintenance of the D2D link are easier to acquire link information between the D2D devices than centralized control, but may increase the complexity of the D2D device.

In 5G, application scenarios for D2D communication may include: local business data transmission, location-based advertising, marketing, maps, friend-making services, and the like. And (4) relay transmission: the relay user helps the user with poor edge signal to communicate with the base station, and the coverage rate of the base station is improved. Emergency communication: emergency communication when a base station is damaged in a disaster. The intelligent home furnishing comprises: the mobile terminal becomes a control center of the family Internet of things.

Data flow is continuously increased in the development of the 5G communication technology, and user coverage is wider. According to the method and the device, the D2D mobile communication technology is used as a relay transmission terminal, communication among communication facilities is guaranteed, and the utilization rate of frequency spectrum resources is improved.

Further, the most focused performance index of the V2X communication scenario is millisecond-level delay, and then capacity, and the like, and the key performance indexes are delay and capacity. In the mobile user communication scenario, the most important is the reliability of channel data transmission, and the next is the capacity, and the key performance indexes are the reliability and the capacity of the channel data transmission. The D2D communication scenario focuses on capacity, and considering performance indicators such as reliability, the key performance indicators are capacity and reliability of channel data transmission.

S203: and respectively allocating corresponding frequency bands and bandwidths to different communication scenes by utilizing a reinforcement learning and deep neural network according to the key performance indexes and priorities of the communication scenes.

Through the deep Q network, Agent agents of different initial positions corresponding to different communication scenes can be generated, and targets corresponding to the communication scenes are respectively determined from all radio frequency bands and bandwidths of the wireless communication system. Wherein the target represents a preferred radio frequency band best suited for the respective communication scenario.

Then, when the radio band allocation is performed, the radio band allocated to each communication scenario can be determined by the deep Q network. If the determined radio frequency band belongs to the allocable radio frequency band and the bandwidth, the radio frequency band is in an idle state and can be allocated, and the determined radio frequency band can be allocated to the corresponding communication scene according to the priority of each communication scene.

The reinforcement learning algorithm belongs to one of machine learning, can autonomously change behaviors through interaction with unknown complex environments so as to obtain the maximum reward value, and is suitable for a cognitive radio network with unpredictable future network conditions. RL is a study of self-learning and adaptive agents on the actions taken by the environment with the goal of maximizing the return on the actions taken. Agents (agents) learn to improve performance by simply observing state changes in their operating environment and reward feedback received after taking action.

The Q learning algorithm works by using a table to store each state and the Q value owned by each action at that state, by interacting with the environment, to progressively optimize the transmission parameters according to the rewards earned. However, if the state space and the number of actions are large, the Q-Learning algorithm suffers from a slow Learning speed, and the performance against disturbance is degraded.

The present application may address this drawback by establishing DQN. DQN is an algorithm combining neural network and Q learning, and the basic idea is to use the neural network to learn Q value and make a Bellman formula, namely, a large number of parameters in a complex environment output a small number of Q values through the neural network, and then Q learning optimization is carried out. The DQN has the advantages of Q learning autonomy in environmental learning, the problem that space parameters of a larger action state are more and difficult to converge is solved by utilizing the neural network, and the algorithm can accelerate the convergence in Q learning by approximating a Q value function at the core by using the artificial neural network.

In machine learning, a neural network is often used to process a large number of parameters, so that the state and the motion can be used as the input of the neural network, and then the Q value of the motion is obtained after the neural network is analyzed, so that the Q value is not necessarily recorded in a table, but is generated directly by using the neural network. And then directly selecting the action with the maximum value as the action to be taken next according to the Q-Learning principle.

Specifically, the DQN algorithm can determine the radio band allocated to each communication scenario by state input and Q value output. Inputting the state s, outputting Q values Q (s, a 1), Q (s, a 2) and the like of all corresponding actions, finding the action corresponding to the maximum Q value from all output actions according to the principle of Q-Learning, and determining the action as the action to be performed next. The state indicates whether or not the agent corresponding to the communication scenario has reached the destination, and the action indicates movement of the agent in a different direction.

And then, determining whether the radio frequency band allocated to each communication scene by the DQN algorithm meets the target corresponding to each communication scene. If the radio frequency band allocated to the communication scenario meets the objectives of the communication scenario, indicating that the allocation is correct, then its corresponding reward value is positive. If the radio frequency band allocated to the communication scenario does not meet the target of the communication scenario in the case that the target of the communication scenario is reachable (i.e. the radio frequency band is idle), indicating that the allocation process is biased, the corresponding reward value is negative or positive.

For example, Agent generated by location 1 targets object 1 and Agent generated by location 2 targets object 2. If the Agent in location 1 reaches goal 1 when goal 1 is reachable (i.e. channel 1 is free), the reward is positive, and if goal 2 is reached, the reward is negative. If the Agent in location 1 reaches goal 2 when goal 1 is not reachable (i.e. channel 1 is occupied), the reward is positive, and if goal 1 is reached, the reward is negative. If both target 1 and target 2 are occupied, the Agent gives up and starts the next cycle. The same principle applies to the Agent generated in location 2.

In one embodiment, each communication scenario may correspond to a plurality of targets, including a first target, a second target. The first target and the second target have different priorities, and the priority of the first target is higher than that of the second target, so that the first target can be called as an optimal target, and the second target can be called as a suboptimal target. The first and second sets do not limit the number of targets, and the number of targets corresponding to a communication scenario may be specifically set as needed, which is not limited in the present application.

Then, when allocating a radio band to communication scenes, first, the radio band may be allocated to each communication scene in turn according to the order of priority of each communication scene. Secondly, when a radio communication frequency band is allocated to a communication scene, the radio frequency band corresponding to the optimal target can be preferentially allocated to the communication scene according to the priority of a plurality of targets corresponding to the communication scene and the sequence of the priority. If the radio frequency band corresponding to the optimal target of the communication scenario has already been allocated to other communication scenarios, the radio frequency bands corresponding to other targets, such as the suboptimal target, may be allocated to the communication scenario in order of priority.

In one embodiment, the system may dynamically control the radio frequency bands allocated to the communication scenario. And detecting the communication quality of a channel connected with the terminal under different communication scenes, and if the communication quality is lower than a preset value and indicates that the communication quality is poor, allocating a radio frequency band and a bandwidth for the corresponding communication scene again through a DQN algorithm.

In the 5G Internet of things environment, the DQN algorithm is applied to frequency band allocation strategies in a V2X communication scene, a mobile user data transmission communication scene and a D2D relay transmission communication scene, and the utilization rate of a 5G frequency spectrum and the reliability of a communication system can be improved.

Further, for the above three communication scenarios, the radio frequency band may be allocated according to the priority of the communication scenario and the priority of the target of the communication scenario.

The communication scene of V2X has the highest priority, which mainly takes the time delay factor into consideration, so that the highest frequency band in the assignable radio frequency band and bandwidth can be determined through reinforcement learning and assigned to the communication scene of V2X. The priority of the communication scene of the mobile terminal is the second priority, which mainly considers the communication reliability, so that the maximum bandwidth in the allocable radio frequency band and the bandwidth can be determined through reinforcement learning and allocated to the communication scene of the mobile user. The D2D communication scenario has the lowest priority, which mainly takes capacity into consideration, and the separation distance between terminals is short, so that a radio frequency band conforming to the characteristic of short terminal separation in the D2D communication scenario can be determined through reinforcement learning and allocated to the mobile user communication scenario.

For ease of understanding, the present application provides a specific radio communication system application scenario for illustration.

As shown in fig. 5, in the V2X scenario, the street lamp and the billboard are regarded as the relay end in the D2D relay transmission application, which can select whether to relay according to the channel condition; users in the car and at the roadside are considered as mobile users. Among the three communication scenarios, the V2X communication scenario in which the automobile is automatically driven has the highest priority, and the D2D relay transmission communication scenario has the lowest priority.

For the three communication scenes, the frequency band is considered preferentially, the bandwidth size of the frequency band is considered secondarily, and finally the most suitable channel is selected for access.

Table 1 shows frequency bands and corresponding bandwidths proposed in the embodiments of the present application.

TABLE 1

For V2X, millisecond delay is a guarantee for vehicle safety, and low millisecond delay can be obtained only in a high-quality channel environment, so as to ensure vehicle safety, so that the optimal frequency band is

. The larger the bandwidth, the more data can be transmitted, so that if a certain environment is existed

The congestion causes poor channel quality, and the V2X suboptimal channel can be selected

。

For mobile users, the reliability of data transmission is prioritized, and on the premise of ensuring security, ultra-clear video appearance, distortion-free voice calls and high-definition picture browsing depend on the reliability of channels. Only with increased reliability can the user experience be further improved. So the optimum band selects the N41 or N78 band, i.e., F, among 5G₂. If the mobile user moves to a region with more users, the suboptimal channel can be selected

。

For D2D relay transmission, a large number of relay terminalsThe end requires a larger bandwidth, but the channel fading is smaller because the interval between the terminals is shorter, and the millimeter wave has strong penetration capability over a short distance but fades very fast over a long distance, which determines that the millimeter wave cannot be allocated to the terminal for long-distance communication, so to speak, the whole is that

Is the optimal choice for assignment to the D2D relay terminal. If in both cases

Is occupied, D2D suboptimal channel can be selected

。

Aiming at the description of the radio frequency spectrum dynamic allocation method based on the reinforcement learning algorithm, the allocation of the radio frequency spectrum is tested through a simulation experiment.

The present simulation only discusses the case of two channels (i.e., frequency bands).

In this simulation experiment, there are two channels Ch1 and Ch2, there are two environments, position 1 and position 2, and there is no channel prior knowledge of each environment. To indicate whether a channel is occupied and the two channels are independent, two randomly generated numbers are used to represent the idle states of the two channels, respectively. The probability of channel occupancy is assumed to be 30%. And presetting different environments corresponding to different optimal channels, and determining that the optimal channel of the environment 1 is Ch1 and the suboptimal channel is Ch 2. The optimal channel of environment 2 is Ch2, and the sub-optimal channel is Ch 1.

The simulation receives external parameter information by a neural network, finally selects actions in a reinforcement learning mode, selects the probability of the optimal action to be 0.9, namely randomly selects other actions with the probability of 0.1, and aims to prevent the occurrence of the local optimal solution. The simulation structure is realized by two neural networks, wherein a target network (target network) is used for predicting the Q value of a target, and parameters cannot be updated in time. An Evaluation network (Evaluation network) is used for predicting the Q value of Evaluation, and the Evaluation network has the latest neural network parameters.

The simulation experiment analyzes that when the reward discount factor is 0.9, the probability of selecting the best operation is 0.9, and the batch size is 32, the learning rates are 0.01, 0.03, and 0.06, respectively, and the replacement targets are 200, 500, and 800, respectively (that is, the neural network parameters are updated every 200 steps, 500 steps, and 800 steps), the user can obtain the learning time curve of the current best channel in the unknown environment.

As shown in fig. 6(1) - (6 (9)), the abscissa in the figure is the number of training steps, and the ordinate is the cost. Fig. 6(1) is a learning time curve when the learning rate is 0.01 and the replacement target is 200, fig. 6(2) is a learning time curve when the learning rate is 0.01 and the replacement target is 500, fig. 6(3) is a learning time curve when the learning rate is 0.01 and the replacement target is 800, fig. 6(4) is a learning time curve when the learning rate is 0.03 and the replacement target is 200, fig. 6(5) is a learning time curve when the learning rate is 0.03 and the replacement target is 500, fig. 6(6) is a learning time curve when the learning rate is 0.03 and the replacement target is 800, fig. 6(7) is a learning time curve when the learning rate is 0.06 and the replacement target is 200, fig. 6(8) is a learning time curve when the learning rate is 0.06 and the replacement target is 500, and fig. 6(9) is a learning time curve when the learning rate is 0.06 and the replacement target is 800.

From a horizontal perspective, convergence may be reached gradually after about 200 iterations when the learning rate is 0.01. The change of the replacement target has little influence on the cost convergence viewed in the longitudinal direction. After about 500 iterations, about 0.2s of time is needed to match the user and spectrum as fast as possible.

When the learning rate is 0.03, the cost convergence rate is not as fast as 0.01 by comparing fig. 6(2) with fig. 6 (5). On the other hand, the larger the replacement target, the slower the cost converges. After about 1000 iterations, about 0.2s is needed to match the user to the spectrum.

Similarly, it can be seen from fig. 6(7) - (6 (9)) that the larger the replacement target is, the slower the cost convergence is, but the higher the learning rate is, the less the cost convergence is. In order to better understand the change of the learning curve when the learning rate is 0.06, the simulation results of the front 2500 rounds are shown here. It is clear that 2500 iterative learning at this learning rate takes about 1 second to match the user to the spectrum. From the analysis, this should be an overfitting phenomenon due to an overfitting rate.

As described above, comparing the learning rate 0.01 with the learning rate 0.03, although the learning rate 0.01 has a fast convergence rate, the learning rate is low, which results in a disadvantage of excessively high learning cost. Comparing the learning rate 0.03 with the learning rate 0.06, since the learning rate of 0.06 is too high, it is easy to overfit, resulting in a slow convergence rate. Overall, the optimum learning rate is 0.03.

Compared with other algorithms, DQN is applicable to situations where state parameters are very large and a priori information is very small. This is not possible with other supervised learning algorithms. With such a multitude of parameters, DQN has great advantages in handling various complex communication models in 5G.

Based on the same inventive concept, the radio spectrum dynamic allocation method based on the reinforcement learning algorithm provided in the embodiment of the present application further provides a corresponding radio spectrum dynamic allocation device based on the reinforcement learning algorithm, as shown in fig. 7.

Fig. 7 is a schematic structural diagram of a radio spectrum dynamic allocation apparatus based on a reinforcement learning algorithm according to an embodiment of the present application, which specifically includes:

a building module 701, which builds a radio communication system model and determines assignable radio frequency bands and bandwidths;

a determining module 702, configured to determine key performance indicators and corresponding priorities corresponding to different communication scenarios;

the allocating module 703 allocates corresponding frequency bands and bandwidths to different communication scenes respectively according to the key performance indexes and priorities of the communication scenes by using the reinforcement learning and deep neural network.

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A radio frequency spectrum dynamic allocation method based on a reinforcement learning algorithm is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step of respectively allocating corresponding frequency bands and bandwidths to different communication scenes according to key performance indexes and priorities of the communication scenes by using a reinforcement learning and deep neural network comprises the following specific steps:

generating agents of different initial positions corresponding to different communication scenes through a deep Q network, and respectively determining targets corresponding to the communication scenes from all radio frequency bands and bandwidths;

and determining the radio frequency band allocated to each communication scene through a deep Q network, and allocating the determined radio frequency band to the corresponding communication scene according to the priority of each communication scene when determining that the radio frequency band belongs to the allocable radio frequency band and the bandwidth.

3. The method according to claim 2, wherein determining the target corresponding to each communication scenario specifically comprises:

determining a first target and a second target corresponding to each communication scene; wherein the priority of the first target is higher than the priority of the second target;

according to the priority of each communication scenario, allocating the radio frequency band corresponding to the target to the corresponding communication scenario, specifically including:

according to the priority of each communication scene, sequentially distributing radio frequency bands to each communication scene;

aiming at each communication scene, distributing a radio frequency band to the communication scene according to the priority of a plurality of targets corresponding to the communication scene;

and determining that the radio frequency band corresponding to the first target of the communication scene is allocated, and allocating the radio frequency band corresponding to the second target to the communication scene.

4. The method according to claim 2, wherein the step of respectively allocating corresponding frequency bands and bandwidths to different communication scenes according to key performance indexes and priorities of the communication scenes by using the reinforcement learning and deep neural network comprises the following specific steps:

determining radio frequency bands allocated to each communication scene through state input and Q value output in a deep Q network;

and determining whether the radio frequency band allocated to each communication scene meets the target corresponding to each communication scene, and determining the corresponding reward value.

5. The method of claim 1, wherein the communication scenario includes a vehicle-to-ambient V2X, a mobile user, a device-to-device D2D; the V2X communication scenario has a higher priority than the mobile user communication scenario, which has a higher priority than the D2D communication scenario.

6. The method of claim 5, wherein the key performance indicators of the V2X communication scenario are latency, capacity;

respectively allocating corresponding frequency bands and bandwidths to different communication scenes, specifically comprising:

and determining the radio frequency band determined by the reinforcement learning as the highest frequency band in the assignable radio frequency band and the bandwidth, and assigning the highest frequency band to the V2X communication scene.

7. The method of claim 5, wherein the key performance indicators of the mobile user communication scenario are reliability, capacity of channel data transmission;

and determining the radio frequency band determined by the reinforcement learning as the maximum bandwidth in the allocable radio frequency band and the bandwidth, and allocating the maximum bandwidth to the communication scene of the mobile user.

8. The method of claim 5, wherein the key performance indicators of the D2D communication scenario are capacity, reliability of channel data transmission;

and determining that the radio frequency band determined by reinforcement learning meets the characteristic of short terminal interval in the D2D communication scene, and allocating the radio frequency band to the mobile user communication scene.

9. The method of claim 1, further comprising:

determining the communication quality of a channel connected with a terminal under different communication scenes;

and under the condition that the communication quality is lower than a preset value, a radio frequency band and a bandwidth are newly allocated to the corresponding communication scene through the reinforcement learning and deep neural network.

10. A radio spectrum dynamic allocation apparatus based on reinforcement learning algorithm, comprising: