CN111246438A

CN111246438A - Method for selecting relay node in M2M communication based on reinforcement learning

Info

Publication number: CN111246438A
Application number: CN202010040442.4A
Authority: CN
Inventors: 潘甦; 吴子秋
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-05
Anticipated expiration: 2040-01-15
Also published as: CN111246438B

Abstract

The invention discloses a method for selecting a relay node in M2M communication based on reinforcement learning, which comprises the following steps: s1, building a reinforcement learning model framework, and determining an action set, a state set and a reward value of reinforcement learning; s2, refining the frame of the reinforcement learning model; s3, carrying out iterative loop on the reinforcement learning model according to specific problems; and S4, adjusting the reinforcement learning model according to the training result, and selecting the relay node by applying the adjusted reinforcement learning model. The invention brings the energy consumption condition of each communication and the energy balance condition of the battery for supplying energy to the relay node into the goal of reinforcement learning, so that the system is balanced in reducing energy consumption and realizing energy balance, the energy imbalance is avoided, and the service life of the whole system is prolonged.

Description

Method for selecting relay node in M2M communication based on reinforcement learning

Technical Field

The invention relates to a relay selection method, in particular to a relay node selection method in M2M communication based on reinforcement learning aiming at battery-powered relay selection, and belongs to the technical field of wireless communication.

Background

In recent years, with the continuous development of computer technology, Machine Learning (Machine Learning) has gradually become the core technology of artificial intelligence, and has attracted extensive attention from numerous researchers in the industry.

Specifically, machine learning can be roughly classified into supervised learning, unsupervised learning, and reinforcement learning according to whether or not a tag exists in sample data in a database thereof. Wherein, the reinforcement learning is a learning method which takes environmental feedback as input and takes statistical and dynamic programming technology as guidance. The inspiration of reinforcement learning comes from the theory of behavioral senses in psychology, namely how an organism develops expectations of stimuli under the stimuli of rewards or penalties given by the environment, resulting in habitual behaviors that can gain the most benefit. Unlike the other two methods of machine learning described above, the advantages of reinforcement learning focus on the problem of the interaction of agents in an uncertain environment, emphasizing how to act based on the environment to achieve the maximum expected benefit. Currently, reinforcement learning has been widely applied in many related fields, and for the communication field, it is generally applied in the scenes of wireless resource allocation, relay selection, and the like.

In the field of wireless communication, a relay technology is one of the research hotspots in recent years, and the relay technology is to arrange a plurality of relay nodes in a wireless communication network, and the relay nodes can assist in transmitting a message to a destination node, so that the wireless channel fading is effectively overcome, the communication quality of a wireless system is improved, and the coverage of the communication system is expanded. In a further aspect of the present invention,

the processing modes of the relay node for the signal mainly comprise an amplifying forwarding mode and a decoding forwarding mode. The method for selecting the relay node generally includes four schemes, namely a maximum signal-to-noise ratio scheme, a nearest neighbor scheme, an optimal worst channel scheme and a minimum harmonic mean scheme, and the basic reference basis of the algorithms is the channel conditions among the source node, the relay node and the destination node.

When the existing relay node selection methods are executed, technicians find that relay nodes with better channel conditions in the whole wireless communication network are frequently selected for information transmission, and battery energy for supplying power to the relay nodes is consumed faster than other relay nodes, so that energy imbalance in the system is caused, and the service life of the whole system is shortened.

In summary, how to provide a new relay node selection method based on the prior art to overcome the defects in the prior art as much as possible is a common research objective of those skilled in the art.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for selecting a relay node in M2M communication based on reinforcement learning for battery-powered relay selection, which is as follows.

A method for selecting a relay node in M2M communication based on reinforcement learning is applied to an M2M communication system and comprises the following steps:

s1, building a reinforcement learning model framework, and determining an action set, a state set and a reward value of reinforcement learning;

s2, refining the frame of the reinforcement Learning model by adopting a Q-Learning algorithm;

s3, determining hyper-parameters in the Q-Learning algorithm according to specific problems, and performing iterative loop on the reinforcement Learning model;

and S4, adjusting the reinforcement learning model according to the training result, and selecting the relay node by applying the adjusted reinforcement learning model.

Preferably, the S1 includes the following steps:

s11, selecting and defining different relay nodes in the M2M communication system as actions in reinforcement learning;

s12, setting the residual energy of the battery of each relay node and the target node of the next communication as the state of reinforcement learning;

and S13, calculating the energy consumed by each relay node in the single communication selection process according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system.

Preferably, the S11 includes the steps of,

recording the number of communication devices in the M2M communication system as N, recording the number of relay nodes as M, and recording the agent action as a ∈ A { (1, 2, …, M }, namely the number of the selected relay node;

the S12 includes the steps of,

when the remaining energy of the battery of the relay node with the number i is Pi, the number of the communication destination device is n, and the reinforcement learning state S is

S＝[P₁,P₂,…,P_M,n]；

The S13 includes the steps of,

according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system, the energy consumed by each relay node in the single communication selection process is calculated by adopting a decoding and forwarding mode, then the energy consumption and the balance condition of a battery of the relay node are integrated to be used as a reward of reinforcement learning, the multi-objective problem is converted into the single-objective problem, and the adjustment of the intention balance points of the multi-objective problem and the relay node is completed through parameter adjustment.

Preferably, in S13, the energy consumption of the relay node is denoted as E, and the energy balance of the relay node is denoted as D_maxThe difference between the maximum energy and the minimum energy of the battery of the relay node is represented, and the reward R of the reinforcement learning is

R＝r₀-ξ_eE-δξ_dD_max，

Wherein, ξ_eAnd ξ_dRespectively representing the weight occupied by the energy consumption and the energy balance, delta representing the mapping of the energy consumption and the energy balance to a similar range, r₀Is a positive number with a value range of

max(ξ_eE+δξ_dD_max)<r₀<2*max(ξ_eE+δξ_dD_max)。

Preferably, the S2 includes the following steps:

s21, performing barrel distribution according to the energy consumed by each relay node;

s22, taking the minimum value of the battery energy of each relay node, subtracting the minimum energy of the battery from the energy characteristic in the reinforcement learning state, and reflecting the energy balance condition of the relay nodes by the result;

and S23, updating the action selection strategy.

Preferably, the S21 includes the steps of,

performing barrel separation according to the energy consumed by each relay node in the single communication selection process obtained in the step S1, so that the energy consumption is reduced from an infinite dimension to a finite dimension;

the S22 includes the steps of,

let the minimum energy of the battery of each relay node be P_m0nThe updated remaining energy of the battery of the relay node with the number i is denoted as p_iModifying the reinforcement learning state S to

S＝[p₁,p₂,…,p_M,n]，

Wherein p is_i＝P₀-P_min；

The S23 includes the steps of,

the value function Q (s, a) of the state-action pair in the Q-Learning algorithm is adopted for updating, and the updating process is that

Q^t+1(s,a)＝(1-α)Q^t(s,a)+α[R^t+γmax_b∈AQ^t(s’,b)]，

Wherein s represents the current state, s' represents the state after state transition, Q^tAnd Q^t+1Q values before and after update are respectively represented, α represents learning rate of reinforcement learning, α value represents that the update of the reward is more dependent on the instant reward, and R^tThe method comprises the steps of representing reward R obtained by current iteration, wherein gamma is a discount factor, represents the importance degree of future reward and is less than 1;

the strategy pi (s, a) in the Q-Learning algorithm represents the probability of selecting a action in the s state, and the action selection strategy is updated by adopting a Softmax function.

Preferably, the S3 includes the following steps:

s31, initializing the reinforcement learning modelThe initialization reward value R is 0, the initialization action selection policy pi (s, a) is 1/M, and the weighting value ξ is determined according to the desire for energy consumption and energy balance_eAnd ξ_d，ξ_eAnd ξ_dThe sum of both equals 1;

s32, determining the learning rate α of reinforcement learning, determining the maximum iteration number K, and performing multiple iterations.

Preferably, the S32 includes the following steps:

obtaining the probability of selecting each action according to the action selection strategy pi (s, a), selecting any relay node according to the probability to transmit data, calculating a reward value according to the energy consumption of the communication and the energy balance condition after the communication is finished after one-time communication is finished, updating a value function Q (s, a), then calculating a new action selection strategy by using a Softmax function according to a new value function to obtain the next communication task, and updating the state of a model when a target node needing to transmit messages is determined.

Preferably, in said S4,

if the reinforcement learning model is not converged after the set maximum iteration times are reached, adjusting the learning rate to be smaller, and carrying out iteration training again;

in the iteration process, before sending the message each time, the expected value of the accumulated reward value obtained by selecting each relay node is calculated according to the state, and the relay node with the largest reward is selected to forward the message.

Compared with the prior art, the invention has the advantages that:

the invention provides a relay node selection method in M2M communication based on reinforcement learning, which brings the energy consumption condition of each communication and the energy balance condition of a battery for supplying energy to the relay node into the goal of reinforcement learning, so that the system is balanced in reducing energy consumption and realizing energy balance, energy imbalance is avoided, and the service life of the whole system is prolonged.

Meanwhile, the method of the invention can lead the system to autonomously finish the selection of the relay node, has strong self-adaptability, and does not need to reset the selection mode of the relay node even if new M2M equipment is added into the system midway, thereby providing convenience for the practical application of the method of the invention.

In addition, the method also provides reference for other related problems in the same field, has very wide application prospect, can be expanded and extended on the basis of the problem, and is applied to other technical schemes related to relay node selection in a wireless communication system.

The following detailed description of the embodiments of the present invention is provided in connection with the accompanying drawings for the purpose of facilitating understanding and understanding of the technical solutions of the present invention.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of a system model according to the present invention.

Detailed Description

The invention discloses a method for selecting a relay node in M2M communication based on reinforcement learning, and the scheme of the invention is further explained by combining with an embodiment.

The overall idea of the invention is as follows: by utilizing the characteristic of low mobility of the M2M communication equipment, the energy consumed by selecting each relay node in single communication is calculated by adopting a Decode and forward (Decode and Forward) mode according to the channel condition from each relay node to the M2M communication equipment and the channel condition from the source node to each relay node. And then selecting the relay nodes as actions (actions) in reinforcement Learning, combining the energy consumed by single communication with the balance of the battery energy of each relay node to serve as rewards (rewards) in reinforcement Learning, taking the battery residual energy of each relay node and the target node of next communication as the states of reinforcement Learning, and carrying out iterative operation by adopting an algorithm of Q-Learning (Q-Learning) and an Action selection strategy of a Softmax function until convergence.

As shown in fig. 1 to fig. 2, a method for selecting a relay node in M2M communication based on reinforcement learning is applied to an M2M communication system, where the system includes a source node, a relay node, and an M2M communication device, and the method includes the following steps:

Specifically, the S1 includes the following steps:

s11, selecting and defining different relay nodes in the M2M communication system as actions in reinforcement learning; that is, the number of communication devices in the M2M communication system is denoted as N, the number of relay nodes is denoted as M, and the agent action is denoted as a ∈ a ═ 1,2, …, M }, i.e., the number of the selected relay node.

S12, setting the residual energy of the battery of each relay node and the target node of the next communication as the state of reinforcement learning; that is, the remaining energy of the battery of the relay node with the number i is Pi, the number of the communication destination device is n, and the reinforcement learning state S is

S＝[P₁,P₂,…,P_M,n]。

S13, calculating energy consumed by each relay node in the single communication selection process according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system;

the method comprises the steps of calculating energy consumed by each relay node in a single communication selection process by adopting a Decode and Forward (DF) forwarding mode according to the channel condition from each relay node to a communication device in an M2M communication system and the channel condition from a source node to each relay node in the M2M communication system, then integrating the energy consumption and the balance condition of a battery of the relay node as a reward for reinforcement learning, converting a multi-objective problem into a single-objective problem, and completing adjustment of an intentional balance point of the multi-objective problem and the single-objective problem through parameter adjustment.

In S13, energy consumption of the relay node is denoted as E, and energy balance of the relay node is denoted as D_maxThe difference between the maximum energy and the minimum energy of the battery of the relay node is represented, and the reward R of the reinforcement learning is

R＝r₀-ξ_eE-δξ_dD_max，

Wherein, ξ_eAnd ξ_dRespectively representing the weight occupied by energy consumption and energy balance; δ represents the mapping of energy consumption and energy balance to a similar range; r is₀A positive number, avoiding a negative reward value, can be adjusted according to the desired requirements for energy consumption and energy balance, typically over a range of values

max(ξ_eE+δξ_dD_max)<r₀<2*max(ξ_eE+δξ_dD_max)。

Specifically, the S2 includes the following steps:

s21, performing barrel distribution according to the energy consumed by each relay node; that is, the energy consumed by each relay node in the single communication selection process obtained in S1 is bucketized, so that the energy consumption is reduced from an infinite dimension to a finite dimension.

S22, taking the minimum value of the battery energy of each relay node, subtracting the minimum energy of the battery from the energy characteristic in the reinforcement learning state, and reflecting the energy balance condition of the relay nodes by the result; namely, the minimum energy of the battery of each relay node is recorded as P_minThe remaining energy of the battery of the relay node numbered i after updating is recorded as p_iModifying the reinforcement learning state S to

S＝[p₁,p₂,…,p_M,n]，

Wherein p is_i＝P_i-P_min。

S23, updating the action selection strategy; namely, the value function Q (s, a) of the state-action pair in the Q-Learning algorithm is adopted for updating, and the updating process is that

Q^t+1(s,a)＝(1-α)Q^t(s,a)+α[R^t+γmax_b∈AQ^t(s’,b)]，

Wherein s represents the current state, s' represents the state after state transition, Q^tAnd Q^t+1Q values before and after update are respectively represented, α represents learning rate of reinforcement learning, α value represents that the update of the reward is more dependent on the instant reward, and R^tThe reward R obtained by current iteration is represented, gamma is a discount factor, represents the importance degree of future reward, and the value is usually less than 1;

Specifically, the S3 includes the following steps:

s31, initializing the reinforcement learning model, setting the initial reward value R to 0, setting the initial action selection policy pi (S, a) to 1/M, and determining the weight value ξ according to the desire for energy consumption and energy balance_eAnd ξ_d，ξ_eAnd ξ_dThe sum of both equals 1;

s32, determining a learning rate α of reinforcement learning, determining a maximum iteration number K, and performing multiple iterations, it should be noted that the learning rate α has a value which is too small, so that the model learning is slow and poor in previous period, and too large, so that the model cannot be converged finally.

Obtaining the probability of selecting each action according to the action selection strategy pi (s, a), selecting any relay node according to the probability to transmit data, and after finishing one communication, according to the energy consumption E of the communication and the energy balance condition D after the communication is finished_maxCalculating a reward value R and updating a value function Q (s, a), then calculating a new action selection strategy by using a Softmax function according to the new value function, obtaining the next communication task, and updating the state of the model when determining a target node needing to send a message.

Specifically, in the step S4,

The selection of the relay nodes is carried out by applying the model, so that the battery energy of each relay node can be ensured to be balanced under the condition of less energy consumption, the expected values of the relay nodes and the relay nodes can be adjusted according to the result, and the expected balance point can be searched between the relay nodes and the relay nodes.

In summary, according to the selection method of the relay node in M2M communication based on reinforcement learning provided by the present invention, the energy consumption of each communication and the energy balance of the battery supplying energy to the relay node are all included in the reinforcement learning target, so that the system is balanced in reducing energy consumption and realizing energy balance, energy imbalance is avoided, and the service life of the whole system is prolonged.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A method for selecting a relay node in M2M communication based on reinforcement learning is applied to an M2M communication system, and is characterized by comprising the following steps:

2. The method for selecting a relay node in reinforcement learning based M2M communication according to claim 1, wherein the S1 comprises the following steps:

3. The method of claim 2, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises:

the S11 includes the steps of,

recording the number of communication devices in an M2M communication system as N, recording the number of relay nodes as M, and recording the agent action as a ∈ A [ {1, 2., M }, namely the number of the selected relay node;

the S12 includes the steps of,

S＝[P₁，P₂，...，P_M，n]；

The S13 includes the steps of,

4. The method of claim 3, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises:

R＝r₀-ξ_eE-δξ_dD_max，

max(ξ_eE+δξ_dD_max)＜r₀＜2*max(ξ_eE+δξ_dD_max)。

5. The method for selecting a relay node in M2M communication based on reinforcement learning of claim 4, wherein the S2 comprises the following steps:

and S23, updating the action selection strategy.

6. The method of claim 5, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises:

the S21 includes the steps of,

the S22 includes the steps of,

let the minimum energy of the battery of each relay node be P_minThe updated remaining energy of the battery of the relay node with the number i is denoted as p_iModifying the reinforcement learning state S to

S＝[p₁，p₂，...，p_M，n]，

Wherein p is_i＝P_i-P_min；

The S23 includes the steps of,

Q^t+1(s，a)＝(1-α)Q^t(s，a)+α[R^t+γmax_b∈AQ^t(s’，b)]，

7. The method for selecting a relay node in M2M communication based on reinforcement learning of claim 6, wherein the S3 comprises the following steps:

8. The method of claim 7, wherein the S32 comprises:

9. The method of claim 8, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises: in the above-mentioned S4, the method,