CN117202377A

CN117202377A - Network conflict-free resource allocation method based on double deep Q network and conflict degree algorithm

Info

Publication number: CN117202377A
Application number: CN202311173545.8A
Authority: CN
Inventors: 黄杰; 杨凡; 张仕龙; 杨成; 赖显智; 喻涛; 丁睿杰; 杨川
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2023-12-08

Abstract

The invention discloses a network conflict-free resource allocation method based on a dual-deep Q network and a conflict degree algorithm, which comprises the steps of constructing a single-ring multi-graph model according to a user interference relation of a network environment, and constructing an optimization problem by taking the maximization of the whole network resource reuse rate rho as a target under the constraint of no resource conflict; then, modeling the optimization problem as a markov decision process for the single-loop multi-graph model; constructing a resource allocation model, wherein the resource allocation model comprises an environment module and a DDQN network module, and the environment module is used for simulating a user interference relationship in a network environment; the DDQN network module comprises a main network, a target network, a replay memory D, an epsilon-greedy algorithm unit and a loss function calculation unit; and finally, selecting an action corresponding to the maximum action value function according to the DDQN algorithm and the conflict degree algorithm to obtain an optimal conflict-free resource allocation strategy. The invention realizes conflict-free and resource reuse rate maximization.

Description

Network conflict-free resource allocation method based on double deep Q network and conflict degree algorithm

Technical Field

The invention relates to the field of network communication resource allocation, in particular to a network conflict-free resource allocation method based on a dual-deep Q network and a conflict degree algorithm.

Background

With the increasing number of devices of the internet of things, 6G-IoT has a greater number of devices, a wider coverage range, and a wider coverage range, resulting in complex overlapping interference. How to efficiently manage resources in 6G-IoT is a challenge. The difficulty of current resource management is how to improve the resource utilization and avoid interference. At present, an online Data Rate (DRL) algorithm is designed based on a Deep Reinforcement Learning (DRL) method, so that the resource utilization rate can be improved. However, most studies currently only provide resource scheduling for network elements, and there are few considerations for global resource management using global network information, which makes resource underutilization.

To increase the resource reuse rate, researchers began to use Digital Twin (DT) to assist the network in resource scheduling from the perspective of the overall network. The resource reuse rate of the network can be remarkably improved through the DT auxiliary network. However, in 6G-IoT, the phenomenon of massive overlapping of networks is very common, and massive overlapping interference is unavoidable, resulting in lower QoS and throughput of the network. Few studies fully consider the resource management of large-scale overlaps in networks, and how to effectively avoid large-scale overlapping interference in DT-6G-IoT remains a problem to be solved.

Disclosure of Invention

In view of the above, the present invention aims to provide a method for allocating network collision-free resources in a dual deep Q network and a collision degree algorithm, which can improve the multiplexing rate of network resources and the network throughput.

The invention aims at realizing the following technical scheme:

the invention provides a network conflict-free resource allocation method based on a dual deep Q network and a conflict degree algorithm, which comprises the following steps:

step 1: acquiring a user interference relation of a network environment to be subjected to resource allocation, constructing a single-ring multi-graph model, and constructing an optimization problem with the maximum of the whole network resource reuse rate rho as a target under the constraint of no resource conflict;

step 2: modeling the optimization problem as a markov decision process for the single-loop multi-graph model, wherein the markov decision process comprises six elements of a state space, an action space, a state transition matrix, an instant rewarding function, a strategy and an action cost function;

step 3: constructing a resource allocation model, wherein the resource allocation model comprises an environment module and a DDQN network module, and the environment module is used for simulating a user interference relationship in a network environment; the DDQN network module comprises a main network, a target network, a replay memory D, an epsilon-greedy algorithm unit and a loss function calculation unit;

Step 4: and selecting an action corresponding to the maximum action value function according to the DDQN algorithm and the conflict degree algorithm to obtain an optimal conflict-free resource allocation strategy.

Further, in the step 1, the construction of the single-ring multi-graph model includes the following steps:

s11: modeling an interference area in a network environment according to hypergraph theory, and constructing a hypergraph model which is expressed as H= { V, E }, wherein V= { V ₁ ,v ₂ ,...,v _n The } represents a set of vertices, representing a set of users, e= { E ₁ ,e ₂ ,...,e _m The relationship between the vertex and the superside is related by an incidence matrix I _h Representation, I _h Corresponding to the vertex of the column I _h The rows in (1) correspond to supersides, wherein,

I _h ＝{a _(i,j) }，i＝1,2,...,n，j＝1,2,...,m

s12: according to the dual theory, the hypergraph model is converted into a dual hypergraph model, and E= { E in the dual hypergraph ₁ ,e ₂ ,...,e _m The header set, i.e., IC set, v= { V ₁ ,v ₂ ,...,v _n The edge set, i.e. the user set, v= { V ₁ ,v ₂ ,...,v _n The edge set is represented by the ring of vertexes, the non-overlapping users are represented by the edge, the overlapping users are represented by the edge, and the dual hypergraph is associated with an association matrix I _dh The association is performed such that,

s13: converting the dual hypergraph model into a multi-graph model through matrix transformation, wherein the multi-graph model and the adjacent matrix A _mulit In association, in the multi-graph model, each edge is associated with only two vertexes, a plurality of edges can exist between the two vertexes, and if a superedge is associated with the plurality of vertexes, the superedge is split into a plurality of edges associated with the two vertexes;

S14: converting the multi-graph model to a single-loop multi-graph model, comprising:

to overlap users from A _mulit Separated from the medium A _mulit Expressed as:

A _mulit ＝A _mu +A _su ，

wherein A is _mu Representing the number of overlapping users between ICs by providing a _mulit Setting the diagonal element to 1 for implementation, and the other values are kept unchanged for implementation; a is that _su Representing the number of non-overlapping users in each IC, by A _su ＝A _mulit -A _mu Obtaining;

according to A _mu And obtaining a single-ring multi-graph model, wherein the single-ring multi-graph model corresponds to an incidence matrix M, the columns of the incidence matrix M correspond to users in the single-ring multi-graph, and the rows correspond to the supersides.

Further, S13 includes:

(1) Calculating the number of rings on each vertex of the dual hypergraph, comprising:

pair I _dh Summing the results and setting the elements other than 1 to 0 to obtain a sequence R, the number of elements in R being equal to the number of edges in the dual hypergraph;

recording whether an edge is a ring by using a one-dimensional column vector R, where R (n) =1 represents an edge v _n Is a ring, R (i) represents the ith element in R;

I _dh r is logically ANDed and the resulting elements are summed to obtain the number of rings for each vertex in the multiple graph, expressed as diag (A _mulit )＝I _dh R, "·" represents a logical and operation;

(2) Computing edges between different vertices of the dual hypergraph, comprising:

Will I _dh Performing logical AND operation on elements in any two rows, and summing to obtain the edge number between two vertexes in the multi-graph corresponding to the two rows;

(3) The number of rings of each vertex in the multiple figures is taken as A _mulit The element on the diagonal line takes the edge number between different vertexes as A _mulit Elements on non-diagonals.

Further, the resource reuse rate ρ in step 1 is defined as:

wherein phi represents the total number of users, k _c Representing the total number of resources.

Further, the conflict-free resource constraint in step 1 is formulated as:

wherein,

k represents a resource set;

D _Confilt (v _i k) denotes user v _i Using the conflict degree of the resource k;

M _Confilt for the conflicting correlation matrix, i.e. correlation matrix M, user v is represented _i A set of conflicts with other users;

e(v _i ) Is an edge set in a single-ring multi-graph model, user v _i And e (v) _i ) The edges in the single-ring multi-graph model are all associated with the same vertex in the single-ring multi-graph model;

further, step 2 includes the steps of:

s21: determining an action space A, wherein, for any time t, a _t E A, in the presence of assigned target user v _i And action a at allocated resource k _t The definition is as follows:

a _t ＝{v _i ,k},k∈K，

wherein K represents the total resources;

s22: determining a state space S, wherein the state S _t Initial state s corresponding to communication resource allocation situation of time t to user in single-ring multi-graph model ₀ Representing states without any resource allocation, defining Markov states using an extended correlation matrix, at time t _i Expanded association matrix M _a (t _i ) Expressed as:

wherein,

M(t _i ) Time t is represented _i The incidence matrix M corresponds to the single-ring multi-graph model;

E _k (t _i ) Time t is represented _i When the communication resource is allocated to each user in the single-ring multi-graph model, the element is 0 to represent unallocated resources;

s23: determination ofInstant prize function r _t The instant prize function is an instant prize function based on a degree of conflict:

wherein D is _Conflict(t) Indicating the degree of conflict at time t, i.er(s _t ,a _t ) Denoted at s _t In-state selection action a _t Is a environmental reward of (1);

s24: determining a state transition matrix P, wherein the state transition matrix P is composed of transition probabilities between states, and the transition probabilities are defined as P (s _t+1 |s _t ,a _t ,s _t-1 ,a _t-1 ,...)＝p(s _t+1 |s _t ,a _t )＝p(s _t ,a _t ,s _t+1 ) Represents the state s for an arbitrary time t _t Application action a _t Transition to the next state s _t+1 Is determined, for a policy pi,

wherein,representing the state s by the policy pi _t Transition to the next state s _t+1 The probability of pi (a|s) is p (a) _t ＝a|s _t =s), which represents the probability of applying the strategy of action a to state s at time t, a to pi (|s), and "-" represents probability distribution sampling;

S25: determining an action cost function q under policy pi ^π (s, a) wherein q ^π (s, a) represents the expected return on discount that can be obtained after action a is applied to state s under policy pi, formulated as:

where E [. Cndot ] represents the desirability operator and γ represents the discount factor.

Further, step 4 includes the steps of:

s41: initializing CCQN network and its parameters including initializing main network weight w and target network weight θ, and selecting initial action a ₀ Initial state s ₀ ；

S42: selecting a random action in action space A for probability threshold epsilon-based _t If not, selecting an action that maximizes the action cost function;

s43: according to the selected action a _t Sum state s _t Calculating instant prize r _t And gets the expanded incidence matrix M at the next time t+1 _a (t+1)；

S44: correlation matrix M using expansion at time t+1 _a (t+1) to obtain the state at time t+1, i.e. s _t+1 ＝M _a (t+1), and will currently experience e _t ＝<s _t ,a _t ,r _t ,s _t+1 >Storing in a replay memory D;

s45: randomly selecting an experience from the replay memory D, wherein the experience is used for training a weight parameter w of the main network, and updating a parameter θ=w in the target network according to a set update step length;

s46: updating greedy factor ε, i.e

Wherein ε _t A greedy factor representing the time t; epsilon _t+1 A greedy factor representing time t+1; epsilon _decay Representing the set attenuation factor; epsilon _min Representing the set minimum greedy factor;

s46: repeating steps S42-S46 for each iteration until the iteration is complete;

s47: and outputting the final network weight w, the target network weight theta and the optimal strategy pi (s, a).

Further, for the weight parameters in the main network, the agent updates by adopting a gradient body surface method, and the weight parameters are expressed as follows:

wherein,

L _t (ω)＝[q(s,a；ω)-y _t ] ² ，

alpha represents the gradient update step size;

L _t (ω) represents a loss function at time t; y is _t And the output of the target network at the time t is shown.

Further, an optimal strategy pi (s, a) is determined, wherein pi (s, a) =argmax _π∈Π q ^π (s, a), i.e. determining the optimal action cost functionEquivalent to finding the optimal strategy, pi represents all strategy sets, for determining the optimal action cost function +.>And (3) approximating the action cost function by adopting a depth network method, namely finding the strategy corresponding to the maximum q (s, a; w) to obtain the optimal strategy.

Further, the determining of the user interference relationship comprises the steps of:

calculating arbitrary two users v _i And v _j The most likely interference between themLarge distance R _(i,j) ；

Acquisition of user v _i And v _j Actual distance between l _(i,j) ；

According to R _(i,j) And l _(i,j) Determining the relationship of user v _i And v _j Interference conditions between _(i,j) Wherein

Inter _(i,j) =1 represents v _i Interference v _j ，Inter _(i,j) =0 indicates no interference.

The beneficial effects of the invention are as follows:

1. in the invention, under the DT framework, the concept of an Interference Cell (IC) is provided according to the interference relation between communication distance and position information analysis equipment, and a hypergraph interference model is established for a large-scale overlapping interference scene in DT-6G-IoT based on the hypergraph theory, so that the interference relation between users is intuitively reflected;

2. in order to reduce the difficulty of resource allocation in DT-6G-IoT, a hypergraph model is simplified into a dual hypergraph by utilizing hypergraph dual theory; converting the dual hypergraph into multiple graphs by analyzing the user interference relationship in the super edge; in order to obtain a resource allocation model of interference cell ICS overlapping, simplifying the multi-graph into a single-loop multi-graph, and converting the conflict-free resource allocation problem into a coloring problem in the single-loop multi-graph;

3. in order to realize conflict-free resource allocation of an overlapping area and improve the resource utilization rate, the invention provides a conflict-free resource allocation algorithm (namely a DDQN-CD algorithm) based on a Double Deep Q Network (DDQN) and a conflict degree algorithm (CD algorithm), wherein the algorithm adopts a DDQN-CD to learn a resource allocation process in DT-6G-IoT, dynamically adjusts a resource allocation method according to experience playback data, realizes conflict-free resource allocation, and improves the resource multiplexing rate and the network throughput to the greatest extent.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings, in which:

FIG. 1 is a DT-6G-IoT architecture in the present invention;

FIG. 2 is a communication interference model in the present invention;

FIG. 3 is an exemplary hypergraph model of the present invention;

FIGS. 4 (a) -4 (c) are, respectively, exemplary dual hypergraph models, multi-graph models, and single-loop multi-graph models of the present invention;

FIG. 5 is a DDQN-CD architecture of DT-6G-IoT in the present invention;

FIG. 6 is a graph of a cumulative benefit versus iteration number for the DDQN-CD algorithm of the present invention;

FIG. 7 is a diagram showing the comparison of simulation results of the DDQN-CD algorithm and the allocation resources of other algorithms in the present invention;

FIG. 8 is a graph of the maximum network throughput of the DDQN-CD algorithm versus other algorithms in accordance with the present invention;

FIG. 9 is a graph of minimum network throughput versus other algorithms for the DDQN-CD algorithm of the present invention;

FIG. 10 is a graph comparing the average network throughput of the DDQN-CD algorithm with other algorithms in the present invention;

fig. 11 is a graph showing the maximum resource multiplexing rate of the DDQN-CD algorithm and other algorithms according to the present invention.

Detailed Description

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be understood that the preferred embodiments are presented by way of illustration only and not by way of limitation.

In order to achieve efficient collision-free resource management in a large-scale ultra-dense network, the present invention proposes a 6G-IoT framework of DT-assisted ultra-dense Access Points (APs) and User Equipments (UEs), as shown in fig. 1. The framework consists mainly of three layers, namely a 6G-IoT layer, an indoor baseband processing unit (BuildingBaseband Unit, BBU) pool and a DT layer. At the 6G-IoT layer, there are n UEs and multiple APs. The overlapping coverage of the communication radius of the terminal and the AP is larger. The terminal forms an ad hoc network by adopting a D2D communication mode, and the AP provides auxiliary access in the network. The present scenario sets two different communication radius terminals, considering the possible communication radius differences between different types of terminals. The terminal perceives surrounding environment information and sends the surrounding environment information to the BBU pool through the 6G-IoT layer AP. The BBU pool consists of BBU and calculation server, collects all environment information in the network, constructs DT layer and virtually maps various node relations in the whole network. The DT layer is a global model that interacts with the terminals and the AP in real time, maintaining synchronization with the physical network. Since the global model contains information of the whole network, DT can be used to assist in resource management of the terminal to avoid interference. In contrast to APs, there are a large number of UEs in DT-6G-IoT, with overlapping interference occurring mainly between the UEs. Thus, the present invention mainly considers large-scale overlapping interference between UEs and D2D schemes.

In some embodiments, the determination of the user interference relationship comprises the steps of:

calculating arbitrary two users v _i And v _j Maximum distance R between which interference can occur _(i,j) ；

Acquisition of user v _i And v _j Actual distance between l _(i,j) ；

A typical interference scenario in DT-6G-IoT is illustrated in fig. 2. Due to the UE ₁ And UE (user equipment) ₂ Within the communication radius of each other, interference occurs when they transmit data at the same time. From UE ₂ To UE (user Equipment) ₁ The received power may be calculated as:

wherein P is _t(2,1) Is UE (user equipment) ₂ G, G _r Is UE (user equipment) ₁ G, G _t Is UE (user equipment) ₂ Transmission gain, h _r Is UE (user equipment) ₁ Antenna height, h _t Is UE (user equipment) ₂ Lambda is the attenuation factor, l _(2,1) Is UE (user equipment) ₁ To UE (user Equipment) ₂ Is a distance of (3).

When P _r(2,1) Exceeding UE ₁ Interference threshold S of (1) _th At this time, the UE ₁ And UE (user equipment) ₃ Communication between them will be interrupted. UE (user Equipment) ₂ Is R _(2,1) Representing UE ₂ And UE (user equipment) ₁ The maximum distance between which interference can occur. When P _r(2,1) ＝S _th ThenI.e.Then->

Comparison R _(i,j) And l _(i,j) ，UE _i And UE (user equipment) _j The interference relationship between can be expressed as:

wherein, inter _(i,j) =1 denotes UE _i Interfering UE _j ；Inter _(i,j) =0 indicates no interference.

In 6G-IoT, each user is interfered by multiple users, and the interference relationship between users is complex. Hypergraph is used as a graphic tool and is widely applied to network topology modeling. Unlike the normal graph, one edge in the normal graph can only correspond to two vertices, and is not suitable for analyzing interference relationships between users in the 6G-IoT. The superside may contain multiple vertices that clearly describe the interference relationship between multiple users in the network. Thus, hypergraph theory is an effective way to analyze DT-6G-IoT massive overlapping interference by modeling the interference area with hypergraph theory according to the interference relationship between users.

The invention builds an interference model of DT-6G-IoT by utilizing hypergraph theory, and converts hypergraph interference into a single-loop multi-graph coloring problem.

In some embodiments, constructing the single-loop multi-graph model includes the steps of:

s11: modeling an interference area in a network environment according to hypergraph theory, and constructing a hypergraph model which is expressed as H= { V, E }, wherein V= { V ₁ ,v ₂ ,...,v _n The } represents a set of vertices, representing a set of users, e= { E ₁ ,e ₂ ,...,e _m And is a superset, a set of ICs representing users. If other users in the infrared range of a user receive interference, the area is defined as an IC (i.e., interference cell) centered on the user. In the hypergraph model, the relationship between the vertex and the hyperedge uses an incidence matrix I _h Representation, I _h Corresponding to the vertex of the column I _h The rows in (a) correspond to the hyperedges of H, wherein,

I _h ＝{a _(i,j) }，i＝1,2,...,n，j＝1,2,...,m，

I _h it can also be written as:

it should be noted that in the hypergraph model, there are two cases in DT-6G-IoT where there is no interference, so they are not considered. The two cases are: 1) The vertex is not related to any superside, meaning that it is an isolated vertex, i.e., the user stops working in DT-6G-IoT; 2) The super-edge is associated with only one vertex, i.e. the super-edge contains only one vertex, which means that the user is not disturbed by other users. In this case, an arbitrary resource can be allocated to the user.

Based on the relationship between the vertex and the superside, the method for establishing the superside can be as follows: device-centric creation of hyperedges, each comprising a number of covered devices. Devices within the superside are within communication range of each other and can cause interference with each other. Not all supersides remain in the hypergraph because if superside e _i Containing superedges e _j Then only the superedge e is retained in the hypergraph model _i 。

FIG. 3 is an exemplary hypergraph model. The correlation matrix I can be obtained according to FIG. 3 _h 。

I _h Can be expressed as:

typically, each IC will be considered a superside. Not all supersides are preserved. For example, in FIG. 3 user v ₁ Is subject to user v ₄ ,v ₇ ,v ₈ ,v ₁₀ ,v ₁₁ Is a part of the interference of the (c). Thus, IC ₁ Is considered as superb e ₁ . Similarly, a superedge e can be obtained ₂ ,e ₃ ,...,e ₁₁ However, e ₄ ,e ₅ ,...,e ₁₁ From e ₁ ,e ₂ And e ₃ Is included, and therefore has no superb e ₄ ,e ₅ ,...,e ₁₁ Only reserve e ₁ ,e ₂ And e ₃ 。

As can be seen from fig. 3, users in an IC can be classified into overlapping users and non-overlapping users. Non-overlapping users refer to users who belong to only one IC, i.e. vertices associated with one hyperedge. Overlapping users are located at multiple different ICs, i.e., users of vertices associated with multiple supersides. The interference relationship of overlapping users is extremely complex, i.e. the allocation of resources for overlapping users is more challenging, compared to non-overlapping users, and the application is therefore mainly studied for such cases. Non-overlapping users after allocation of overlapping users, the remaining resources may be allocated using a conventional resource allocation algorithm.

Although the hypergraph model can represent interference among users, the relationship between the hyperedges and the vertexes is complex, and analysis and resource allocation of the interference are inconvenient. For example, as the number of users increases and the interference situation complicates, non-overlapping users on a single super-edge vertex are difficult to distinguish significantly from other users.

To solve this problem and simplify the large-scale overlapping overlaid DT-6G-IoT hypergraph model, the present invention converts the hypergraph model into a dual hypergraph model according to the dual theory in S12.

In the dual hypergraph, the ring of vertices represents non-overlapping users, and the edge connecting the two vertices represents overlapping users. In this way, non-overlapping users and overlapping users can be accurately distinguished. Notably, the vertices represent the interference cell IC in the dual hypergraph, with the interference relationship between users unchanged.

In the dual hypergraph, e= { E ₁ ,e ₂ ,...,e _m The header set, i.e., IC set, v= { V ₁ ,v ₂ ,...,v _n And represents the edge set, i.e., the user set.

Dual hypergraph and an associated matrix I _dh And (3) association, wherein,

also, taking the hypergraph model shown in fig. 3 as an example, as can be seen from equation (5),

in the formula (7), the term (1) represents a user having at least two overlapping sides (i.e., e _i Associated with at least two vertices in the dual hypergraph model). From equation (7), a corresponding dual hypergraph model can be obtained, as shown in fig. 4 (a).

In the dual hypergraph model, there are still users associated with multiple hyperedges, such as user v ₈ Is interfered by a plurality of Interfering Users (IUs), as in fig. 4 (a). While avoiding multiple IUS interference, it is difficult to allocate resources to user v ₈ . To simplify analysis of users with multiple IUs, we convert the dual hypergraph into a multi-graph in which each edge is associated with only two vertices between which multiple edges may exist. If a superedge is associated with multiple vertices, it can be split into multiple phases with two vertices in the multi-graphAn associated edge. For example, in FIG. 4 (a), the superside v ₈ And vertex e ₁ ,e ₂ ,e ₃ The connection can be divided into three different sides, which are connected in pairs at e ₁ ,e ₂ And e ₃ And (3) upper part. In the multi-graph model, the same symbols are used as in the dual hypergraph.

Thus, at S13: the dual hypergraph model is converted into a multi-graph model through matrix transformation. Incidence matrix I of dual hypergraph _dh Adjacency matrix A that can be transformed into a multi-graph model by matrix transformation _mulit . The method comprises the following steps:

I _dh performing logical AND operation on R, and summing the obtained elements to obtain the ring number of each vertex in the multi-graph;

The above process is formulated as:

diag(A _mulit )＝I _dh ·R*， (8)

where "·" represents a logical AND operation.

Can be obtained according to the formulas (7) and (8),

adjacency matrix A in a multiple-graph model _mulit Wherein the elements represent the number of edges between vertices, the diagonal elements represent the number of users in each IC, and the non-diagonal elements represent the number of overlapping users.

According to A _mulit The transformed multi-map model 4 (b) of fig. 4 (a) can be obtained. In the multiple graph model, the rings represent non-overlapping users with corresponding ICs. Since non-overlapping users only need to consider interference of one IU, while overlapping users need to consider interference of multiple IUs, the resource allocation policies of the two users are different.

In the multiple graph model, the rings represent non-overlapping users with corresponding ICs. Since non-overlapping users only need to consider interference of one IU, while overlapping users need to consider interference of multiple IUs, the resource allocation policies of the two users are different.

For non-overlapping users, as v in FIG. 4 (b) ₅ The conventional resource allocation strategy does not take overlapping interference into account, as it only needs to avoid using the same resources as one IC. Therefore, only the resource allocation of the overlapped users can be focused, and the overlapped users can be controlled from A _mulit Separated out. A is that _mulit Can be expressed as:

A _mulit ＝A _mu +A _su ， (10)

wherein A is _mu Representing the number of overlapping users between ICs by providing a _mulit Setting the diagonal element to 1 for implementation, and the other values are kept unchanged for implementation; a is that _su Representing the number of non-overlapping users in each IC, by A _su ＝A _mulit -A _mu Obtained.

According to formula (9), it is possible to obtain:

thus, at S14: according to A _mu Obtaining a single-ring multi-graph model, namely converting the multi-graph model into a single-ring multi-graph model, as shown in fig. 4 (c). Next, in the single-ring multi-graph model, the resource allocation problem of the collision-free overlapping users is converted into an edge coloring problem. In the edge coloring problem, different colors represent different resources, and colors can be reused for multiple users.

The single-ring multi-graph model corresponds to an incidence matrix M, the columns of the incidence matrix M correspond to users in the single-ring multi-graph, and the rows correspond to the supersides. The correlation matrix M may be used to calculate the interference strength between users. The calculation of the interference strength contributes to the convergence of the later stages of the DRL.

The correlation matrix M of fig. 4 (c) can be expressed as:

wherein the rows represent vertices of the matrix M and the columns represent edges of the matrix M. v ₅ ,v ₉ ,v ₁₀ ,v ₁₁ The users belonging to non-overlapping groups in fig. 4 (b) are shown and are therefore not considered in fig. 4 (c).

To evaluate the interference strength of DT-6G-IoT, two types of collisions constituting DT-6G-IoT interference are defined and a method of interference strength calculation based on a single loop multi-graph model is presented. Two types of conflicts include: 1) Different users connected to the same vertex are allocated the same resources; 2) Different resources are repeatedly allocated to the same user.

The invention uses the conflict incidence matrix M _conflict (i.e., the correlation matrix M) to represent the user v _i A set of conflicts with others.

User v _i The degree of conflict in using resource k can be defined as:

wherein,

in FIG. 4 (c), e (v) _i ) Is an edge set, v _i And e (v) _i ) All of which are associated with the same vertex.

In DT-6G-IoT, all user resources must be in D in order to meet conflict-free resource allocation requirements _Conflict (v _i K) =0. Thus, the requirements (i.e., constraints) for collision-free resource allocation are:

where K represents a set of network communication resources.

Total number of users phi and number of communication resources k _c And a certain relation exists between the two, so that the effect of the resource multiplexing condition can be evaluated. The overall network resource reuse rate ρ can be defined as:

under the requirement (16), ρ should be made as large as possible. The sufficient condition of the collision-free resource allocation strategy is to obtain the maximum value of the sum of the resource allocation influence values of each user, namely:

the goal of the resource allocation strategy is to maximize the overall resource allocation of the DT-6G-IoT under the constraint of no resource conflict. Constraint C ₁ Indicating that the resource allocation results for the entire network are collision-free.

The resource management problem of overlapping regions in DT-6G-IoT can be seen as a single-ring multi-graph coloring problem with multiple edges between one vertex. Based on the above, the invention provides a conflict-free resource allocation strategy based on a dual deep Q network and a conflict degree algorithm (namely a DDQN-CD algorithm) so as to improve the resource reuse rate.

The DT-6G-IoT network is assumed to be stable, that is to say the total number of users tends to be constant. The evaluation of the communication resource allocation then depends on three key factors: current network state M, total number of users phi, currently allocated communication resources K. A collision-free resource allocation strategy is an optimal decision problem. For this purpose, the invention proposes a DDQN-CD algorithm to sense the DT-6G-IoT communication resource allocation process and designs a communication resource allocation strategy that maximizes the DT-6G-IoT collision-free resource reuse rate.

Based on the conflict-free resource allocation policy in DT-6G-IoT, a finite long Markov chain model may be constructed. In the model, state s _t The communication resource allocation situation corresponds to time t. Initial state s in finite-length Markov chain model ₀ A state in which there is no resource allocation is indicated, and it is indicated that no communication resource has been allocated to any user in fig. 4 (c). The markov state time series with the time length M is s (t ₀ ),s(t ₁ ),...,s(t _i ),...,s(t _M-1 ),s(t _i ) E S, where S is a finite set { S } ¹ ,...,s ^N I.e., |s|=n, i.e., |s|=n. Therefore, the state transition probability P (s (t _i+1 )|s(t _i ) A) can be defined as:

P(s(t _i+1 )|s(t _i ))＝Pr(s(t _i+1 )|s(t _i ),s(t _i-1 ),...,s(t ₀ ))。 (19)

according to equation (13), the association matrix M cannot contain the resource allocation status. The association matrix M cannot be defined as markov states of DT-6G-IoT to design resource allocation policies. The present invention therefore proposes an extended correlation matrix to define markov states. At time t _i Expanded association matrix M _a (t _i ) Expressed as:

wherein M (t) _i ) Time t is represented _i The incidence matrix M corresponds to the single-ring multi-graph model; e (E) _k (t _i ) Time t is represented _i And when the communication resource is allocated to each user in the single-ring multi-graph model, the element 0 represents unallocated resources. The expanded association matrix contains the relationship between the allocated communication resource situation and different users.

FIG. 4 (c) corresponding expanded correlation matrix M _a (t _i ) Can be expressed as:

the DT-6G-IoT conflict-free resource allocation problem is a combinatorial optimization problem, i.e., finding optimal statesAccording to formula (18), status->Representing resource allocation for all users in DT-6G-IoT, with the premise of ensuring that the resources do not conflict (i.e., Σ _i D _Conflict (v _i K) =0) maximizes the resource multiplexing rate ρ. Therefore, the present invention still further proposes constructing the state transition matrix P to obtain the optimal state +.>The state transition matrix may be defined as:

wherein the element P in the state transition matrix P _ss' Representing the probability of a state transition from state s to s', satisfies p _ss' ＝p(s'|s)，∑ _j p(s _j |s)＝1,

Building a Markov Decision Process (MDP) framework is a major task to implement reinforcement learning. MDP consists of six basic elements, including an action value function Q (S, a), (S ε S, a ε A), an action space A, a state space S, a state transition matrix P, a policy pi, and an instant rewards function r (S, a).

In some embodiments, step 2 comprises the steps of:

a _t ＝{v _i ,k},k∈K， (23)

wherein K represents the total resources;

s22: determining a state space S, wherein the state S _t Initial state s corresponding to communication resource allocation situation of time t to user in single-ring multi-graph model ₀ Representing states without any resource allocation, defining Markov states using the expanded association matrix;

s23: determining an instant prize function r _t The instant rewards function is an instant rewards function based on conflict degree, the instant rewards r (t) are environment rewards when the action a (t) is selected in the s (t) state, the instant rewards implicitly define the target of the optimal decision problem, namely, the rewards function can define the forward direction of MDP learning. At time t in the environment, the reward may be expressed as r _t ＝r(s _t ,a _t )。

According to equation (18), under constraints, the agent selects an action that maximizes immediate consideration. To obtain the maximum immediate return, the agent may select an action that maximizes the action cost function, i.e., when D _Conflict When=0, ρ is maximum. The randomness explored in reinforcement learning can lead to agents performing illegal operations for resource allocation. To address this problem, action rewards are classified into active rewards (positive feedback) and inactive rewards (negative feedback) when considering the allocation of user resources in the DT-6G-IoT. If the performed operation causes interference between users, immediate Rewards r _t Will be a negative feedback. When DT-6G-IoT allocates communication resources to satisfy D _Conflict When (e, k) =0, the user gets positive feedback rewards according to the resource reuse rate ρ.

Thus, the instant prize function r based on the degree of conflict _t The formula can be expressed as:

s24: a state transition matrix P is determined. By applying action a to state s, the MDP system is able to complete a state transition from the current state s to the next state s'. Its transition probability can be expressed as p (s, a, s'). The actions a, states s and s ' are required to satisfy the conditions that p (s, a, s '). Gtoreq.0 and p (s, a, s '). Ltoreq.1 and sigma _s'∈S p(s,a,s')＝1。

Selection a of the current action according to Markov properties _t Independent of previous actions and history. Define transition probability as p(s) _t+1 |s _t ,a _t ,s _t-1 ,a _t-1 ,...)＝p(s _t+1 |s _t ,a _t )＝p(s _t ,a _t ,s _t+1 ) Represents the state s for an arbitrary time t _t Application action a _t Transition to the next state s _t+1 Is determined, for a policy pi,

in MDPEquivalent to Markov chainIs a transition probability of (a). />Representing the state s by the policy pi _t Transition to the next state s _t+1 Is a probability of (2).

Given a Markov decision process (S, A, P, r, γ), for each state S ε S and action a ε A, policy pi is the rule that the agent uses to decide what action to perform next. For each state S ε S, if this state satisfies pi (S, a) > 0 and Σ _a∈A Pi (s, a) =1, it is defined as a random strategy. The probability of pi (a|s) is p (a) _t ＝a|s _t =s), and a to pi (·|s), "-" represents probability distribution sampling.

Since the optimal strategy pi causes the Agent to obtain the optimal action value function q (s, a), the optimal strategy pi can be implicitly solved by obtaining the optimal value function. Thus, the optimal Bellman equation can be obtained

Where q (s, a) represents the optimal expected long-term benefit of each state-action pair, then the optimal strategy pi (s, a) =argmax _π∈П q ^π (s, a). Thus (2)Obtaining an optimal action value functionEquivalent to finding the optimal strategy pi.

For DT-6G-IoT, it is difficult to solve the optimal action value function q (s, a) directly with an exhaustive method, and the state transition probability p (s' |s, a) is generally unknown. Therefore, the invention introduces offline policy TD control to obtain the optimal action value function Q (s, a), i.e. Q-learning. The basic idea is to estimate the action cost function based on the reward and the maximum Q value. The update rule is an upgrade of TD learning, and uses an action value function and a built-in MAX operator to replace the action value function of the next state. The action value function is stored in the |s|×|a|q table.

Where α is the learning rate. The epsilon-greedy algorithm is the most basic search algorithm, i.e., the probability of choosing the best action is 1-epsilon, and the probability of choosing another action randomly is epsilon. In general, algorithms are more prone to select the action of maximum Q under the current environmental conditions. Thus, in order to get more consideration, the agent needs to try different actions to explore the potentially better action environment.

For example, when the size of the optional MDP action is too large, the agent cannot efficiently and quickly traverse all state action pairs, let alone find the optimal action policy. Therefore, in order to solve a large-scale or even infinite motion state space, a depth network method is adopted to approach the motion value function q (s, a). The definition is as follows:

q(s,a；ω)≈q _* (s,a)。 (30)

in order to eliminate the uncertainty problem caused by the increase of states and the number of actions and even the infinite of the external environment, a neural network (deep Q-network, DQN) with a parameter omega is used for representing an original action value function, and function fitting is performed. Therefore, when solving a state-effect pair larger problem with a DRL, only the weights of the neural network need be stored. The objective function of the DRL is mathematically defined as follows:

experience e _t ＝<s _t ,a _t ,r _t ,s _t+1 >A prediction error (loss function) L (ω) is proposed to update ω of the motion value function q (s, a; ω); l (ω) may be defined as

L(ω)＝[q(s,a；ω)-y] ² , (32)

Where y is the target output of the DQN. The prediction error (loss function) in equation (32) is minimized. Upon obtaining a new experience ET, the agent will update ω using the gradient body surface method. The process of updating the weight W of the Q-network is as follows

w←w-α[q(s,a；w)-y]▽q(s,a；w)。 (34)

In DQN, the value function has a maximization operation in its updated formula. The DT6G-IoT selects the action a with the largest resource multiplexing rate ρ for user resource allocation. This selected action a maximizes the value of q (s, a; ω). The estimated value function is generally made larger than the true value function, and the error increases with the number of acts. To solve the overestimation problem of the cost function, different cost functions are used to implement the behavior selection and evaluation according to the definition of DDQN-CD, namely

The action is selected by the primary network q (s, a; ω) and its value is evaluated by the target network q (s, a; θ). DDQN uses different value functions for action selection and evaluation, respectively. The Q value is not necessarily the maximum value when in the target network, and thus may avoid being selected to overestimate the suboptimal behavior.

As shown in FIG. 5, the present invention proposes a DT-6G-IoT oriented DDQN-CD architecture, including a flow chart of overlapping region user resource allocation and Agent replay memory training. The MDP architecture of DT-6G-IoT consists of 6 functional modules 1) a single-loop multi-graph model of DT-6G-IoT; 2) A primary network and a target network; 3) A loss function; 4) A replay memory D; 5) An epsilon-greedy algorithm unit; 6) A loss function L unit.

The input of collision-free resource allocation based on the DDQN-CD algorithm has six parameters: 1) Capacity D of replay memory _max The method comprises the steps of carrying out a first treatment on the surface of the 2) Attenuation rate epsilon of greedy factor epsilon _decay The method comprises the steps of carrying out a first treatment on the surface of the 3) A minimum value of epsilon; 4) Updating the frequency of the target network; 5) Expanded incidence matrix M corresponding to single-loop multi-graph model in DT-6G-IoT _a The method comprises the steps of carrying out a first treatment on the surface of the 6) Greedy factor initial value ε ₀ . The output of the DDQN-CD algorithm includes: 1) The weight of the Q function (i.e., the primary network); 2) The weight of the target Q function (i.e., the target network); 3) And (5) optimally distributing the user resources in the overlapping area.

Thus, as can be seen from the above description, the collision-free resource allocation based on the DDQN-CD algorithm proposed by the present invention may include the following steps:

s46: updating greedy factor ε, i.e

Wherein ε _t A greedy factor representing the time t; epsilon _t+1 A greedy factor representing time t+1; epsilon _decay Representing the decay rate of the set greedy factor epsilon; epsilon _min Representing the set minimum greedy factor;

The DDQN-CD can derive the experience e by random sampling from the replay memory D _t+1 . Thus, the weight parameter w in the Q function (i.e., the main network) may be updated by gradient training, and the parameter θ in the target Q function (i.e., the target network) may also be updated once according to the update frequency.

The computational complexity and performance of the DDQN-CD algorithm are described below.

Since the computational complexity of the DQN comes from the matrix operation of the fully connected layers, the complexity of the DQN with J fully connected layers can be described as Wherein K is _j The nerve size of the J-th layer is indicated. The DDQN-CD framework is a combination of Q learning and two structurally identical Deep Neural Networks (DNNs). Since the Q-network propagates forward only during the training step, the DDQN-CD algorithm has an overall computational complexity of +.>Where O (T) is the complexity of the E greedy strategy.

The present invention was built and extensively simulated on a general purpose server equipped with Intel (R) Xeon (R) Gold 6242R [email protected] (20 CPUs) 64GB memory and NVIDIA GeForce RTX 3080Ti GPU to evaluate the performance of the proposed algorithm. A summary of all key simulation parameters for the environmental model and the DRL model is presented in table i.

TABLE 1 parameter settings

Performance evaluation indexes of the DDQN-CD algorithm include cumulative benefits, network throughput, and resource reuse rate.

1) Cumulative benefit: the convergence performance of the DDQN-CD algorithm is represented by the cumulative benefit U. The greater the cumulative revenue, the better the estimated performance of the DDQN-CD model. The cumulative benefit U can be defined as

2) Network throughput: the evaluation index is a classical index [34]. The metric calculates the network throughput of the individual user.

The network throughput ω can be expressed as

3) Conflict-free resource reuse rate: and (3) evaluating the resource reuse rate of the conflict-free resource allocation by using the conflict-free resource reuse rate rho defined in the formula (17).

The simulation results are described below.

And verifying the convergence of the DDQN-CD algorithm through a simulation experiment. The convergence of the DDQN-CD algorithm is determined by measuring the transform trend of the cumulative benefit U.

FIG. 6 is a simulation graph of cumulative revenue versus iteration number for the DDQN-CD algorithm of the present invention. Wherein the curve represents the averaged data. The cumulative benefit U of the DDQN-CD algorithm varies with the number of training sessions. Before 1000 iterations, the cumulative revenue U increases or decreases significantly, i.e., the DDQN-CD algorithm is exploring a collision-free resource allocation strategy, with the average revenue curve in a rising state. After 1000 iterations, the accumulated revenue U mostly keeps fluctuating between 500 and 600, namely the DDQN-CD algorithm converges, and the average revenue curve is in a stable state.

Fig. 7 is a simulation comparison result of the DDQN-CD algorithm (fig. 7 (a)) and the Random Matching (RM) algorithm (fig. 7 (b)), the Greedy Matching (GM) algorithm (fig. 7 (c)) and the Maximum Node Degree (MND) algorithm (fig. 7 (d)) on allocated resources. The communication resource used in the simulation experiment is a Frequency Resource (FR). In the DDQN-CD algorithm, FRSs 19 and 20 are reused twice, and FRSs 6,10,16 and 17 are reused once for a total of 12 different FRSs. In the RM algorithm, FRSs 11, 13 are reused once, FRSs 1, 14 are not allocated, and 18 different FRSs are used in total. In GM algorithm FR1 is reused twice, while FRSs 5, 10, 11 are not assigned to any user, for a total of 17 different FRSs. In the MND algorithm, FR2 is reused twice, FR4 is reused once, FR9, 10, 14, 19 and 20 are not allocated to any user, and a total of 15 different FRSs are used. According to (37), network throughput of the DDQN-CD algorithm, RM algorithm, GM algorithm, and MND algorithm reaches 80.6Mb/s, 53.7Mb/s, 58.2Mb/s, and 67.1Mb/s, respectively.

Fig. 8 shows the maximum network throughput of DT-6G-IoT under different algorithms. The maximum network throughput is the best networking performance of DT-6G-IoT without interference. Typically used to design the upper bound of the network. In fig. 8, simulation results for different numbers of users are shown in order from left to right, as a result of the DDQN-CN algorithm, as a result of the RM algorithm, as a result of the GM algorithm, and as a result of the MND algorithm. As shown in fig. 8, the DDQN-CD algorithm achieves a maximum network throughput above 80Mb/s when there are 20 users of DT-6G-IoT. The network throughput of other algorithms is below 80 Mb/s. When there are 45 users of DT-6G-IoT, the maximum network throughput achieved by the DDQN-CD algorithm can be up to 200Mb/s or more. The MND algorithm achieves a maximum network throughput of between 180Mb/s and 200 Mb/s. The network throughput of other algorithms is between 140Mb/s and 160 Mb/s. Overall, the DDQN-CD algorithm has great advantages over DT-6G-IoT in terms of network throughput and communication capabilities.

Fig. 9 gives the minimum network throughput for DT-6G-IoT under different algorithms. The minimum network throughput reflects the minimum basic communication performance when the DT-6G-IoT is severely interfered with. Which is the lower limit of network performance. In fig. 9, simulation results for different numbers of users are shown in order from left to right, as a result of the DDQN-CN algorithm, as a result of the RM algorithm, as a result of the GM algorithm, and as a result of the MND algorithm. As shown in fig. 9, the DDQN-CD algorithm achieves a minimum network throughput between 60Mb/s and 80Mb/s when the DT6G-IoT has 20 users. The network throughput of other algorithms is between 40Mb/s and 60 Mb/s. The minimum network throughput achieved by the DDQN-CD algorithm is between 180Mb/s and 200Mb/s when there are 45 users of DT-6G-IoT. The minimum network throughput of MND algorithm, GM algorithm and RM algorithm is between 160Mb/s and 180Mb/s, 140Mb/s above and 120Mb/s and 140Mb/s, respectively. In summary, as network density increases, the DDQN-CD algorithm has significant advantages over other algorithms in improving the minimum throughput of the network.

Fig. 10 shows the average network throughput of DT-6G-IoT under different algorithms. Average network throughput is typically used to evaluate network performance in actual interference scenarios, which effectively reflects DT-6G-IoT's communication capabilities in actual applications. In fig. 10, simulation results for different numbers of users are shown in order from left to right, as a result of the DDQN-CN algorithm, as a result of the RM algorithm, as a result of the GM algorithm, and as a result of the MND algorithm. As shown in fig. 10, when DT-6G-IoT has 20 users, the DDQN-CD algorithm, RM algorithm, and MND algorithm all have average network throughput above 60 Mb/s. The average network throughput of the RM algorithm is between 40Mb/s and 60 Mb/s. The average throughput increase of DT-6G-IoT is the largest with the DDQN-CD algorithm followed by the MND algorithm. Overall, the average network throughput of the RM algorithm and the GM algorithm is similar. When there are 45 users of DT-6G-IoT, the average network throughput of the DDQN-CD algorithm and the MND algorithm is between 180Mb/s and 200 Mb/s. The DDQNCD algorithm has a high average network throughput. The DDQN-CD algorithm is more stable than other algorithms than the minimum network throughput shown in the figure, with small network throughput fluctuations.

Fig. 11 shows the maximum resource reuse rates of DT-6G-IoT under different algorithms. In fig. 11, four simulation results from top to bottom represent the results of the DDQN-CD algorithm, MND algorithm, RM algorithm, and GM algorithm, respectively. As shown in fig. 11, the maximum resource reuse rate of the DDQN-CD algorithm is the largest. When the number of users is greater than 25, the maximum resource reuse rate of the DDQN-CD algorithm fluctuates between 0.5 and 0.6. Compared with the DDQN-CD algorithm and the MND algorithm, the maximum resource reuse rate of the RM algorithm and the GM algorithm is lower, and when the number of users is more than 35, the maximum resource reuse rate of the RM algorithm and the GM algorithm starts to be rapidly reduced.

The invention researches the resource management problem of large-scale ultra-dense DT-6G-IoT with large-scale overlapping interference. Firstly, a DT-assisted 6G-IoT resource allocation model is proposed and the interference relationship between users is represented based on hypergraphs. Then, to avoid interference between users in the overlapping area, the hypergraph is converted into a single-loop multi-graph, and the collision-free resource allocation problem is converted into an average network throughput (Mb/s) resource reuse rate (%) multi-graph coloring problem. Finally, a DDQN-CD algorithm is designed, and conflict-free and resource reuse rate maximization is realized. Experimental results show that the algorithm has higher network resource multiplexing rate and throughput, and can provide better performance in DT-6G-IoT.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A network conflict-free resource allocation method based on a dual deep Q network and a conflict degree algorithm is characterized by comprising the following steps:

2. The network collision-free resource allocation method based on the dual deep Q network and the collision degree algorithm according to claim 1, wherein in the step 1, the construction of the single-ring multi-graph model comprises the following steps:

I _h ＝{a _(i,j) }，i＝1,2,...,n，j＝1,2,...,m

s12: according to the dual theory, the hypergraph model is converted into a dual hypergraph model, and E= { E in the dual hypergraph ₁ ,e ₂ ,...,e _m "represents a set of vertices, i.e. a set of ICs，V＝{v ₁ ,v ₂ ,...,v _n The edge set, i.e. the user set, the ring of vertices represents non-overlapping users, the edge represents overlapping users, the dual hypergraph and an associated matrix I _dh The association is performed such that,

to overlap users from A _mulit Separated from the medium A _mulit Expressed as:

A _mulit ＝A _mu +A _su ，

3. The network collision-free resource allocation method based on the dual deep Q network and the collision degree algorithm according to claim 2, wherein S13 comprises:

4. The network collision-free resource allocation method based on the dual deep Q network and the collision degree algorithm according to claim 2, wherein the resource reuse rate ρ in step 1 is defined as:

5. The network collision-free resource allocation method based on the dual deep Q network and the collision degree algorithm according to claim 2, wherein the collision-free resource constraint in step 1 is formulated as:

wherein,

k represents a resource set;

6. the network collision-free resource allocation method based on the dual deep Q network and the collision degree algorithm according to claim 5, wherein step 2 comprises the steps of:

a _t ＝{v _i ,k},k∈K，

wherein K represents the total resources;

wherein,

s23: determining an instant prize function r _t The instant prize function is an instant prize function based on a degree of conflict:

wherein,representing the state s by the policy pi _t Transition to the next state s _t+1 The probability of pi (a|s) is p (a) _t ＝a|s _t =s), which represents the probability of applying the strategy of action a to state s at time t, a to pi (|s), and "to" represent probability distribution sampling;

7. The network collision-free resource allocation method based on the dual deep Q network and the collision degree algorithm according to claim 2, wherein the step 4 comprises the steps of:

s46: updating greedy factor ε, i.e

8. The network collision-free resource allocation method based on the dual deep Q network and the collision degree algorithm according to claim 7, wherein for the weight parameters in the main network, the agent updates by using a gradient body surface method, expressed by a formula:

wherein,

L _t (ω)＝[q(s,a；ω)-y _t ] ² ，

alpha represents the gradient update step size;

9.The network collision-free resource allocation method based on the dual deep Q network and the collision degree algorithm according to claim 7, wherein an optimal policy pi (s, a) is determined, wherein pi (s, a) =argmax _π∈∏ q ^π (s, a), i.e. determining the optimal action cost functionEquivalent to finding the optimal strategy, pi represents all the strategy sets, for determining the optimal action cost functionAnd (3) approximating the action cost function by adopting a depth network method, namely finding the strategy corresponding to the maximum q (s, a; w) to obtain the optimal strategy.

10. The network collision-free resource allocation method based on the dual deep Q network and the collision degree algorithm according to claim 1, wherein the determination of the user interference relationship comprises the steps of:

Acquisition of user v _i And v _j Actual distance between l _(i,j) ；