CN111246438A - Method for selecting relay node in M2M communication based on reinforcement learning - Google Patents

Method for selecting relay node in M2M communication based on reinforcement learning Download PDF

Info

Publication number
CN111246438A
CN111246438A CN202010040442.4A CN202010040442A CN111246438A CN 111246438 A CN111246438 A CN 111246438A CN 202010040442 A CN202010040442 A CN 202010040442A CN 111246438 A CN111246438 A CN 111246438A
Authority
CN
China
Prior art keywords
relay node
reinforcement learning
communication
energy
selecting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010040442.4A
Other languages
Chinese (zh)
Other versions
CN111246438B (en
Inventor
潘甦
吴子秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202010040442.4A priority Critical patent/CN111246438B/en
Publication of CN111246438A publication Critical patent/CN111246438A/en
Application granted granted Critical
Publication of CN111246438B publication Critical patent/CN111246438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/70Services for machine-to-machine communication [M2M] or machine type communication [MTC]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W40/00Communication routing or communication path finding
    • H04W40/02Communication route or path selection, e.g. power-based or shortest path routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W40/00Communication routing or communication path finding
    • H04W40/02Communication route or path selection, e.g. power-based or shortest path routing
    • H04W40/04Communication route or path selection, e.g. power-based or shortest path routing based on wireless node resources
    • H04W40/10Communication route or path selection, e.g. power-based or shortest path routing based on wireless node resources based on available power or energy
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W40/00Communication routing or communication path finding
    • H04W40/02Communication route or path selection, e.g. power-based or shortest path routing
    • H04W40/12Communication route or path selection, e.g. power-based or shortest path routing based on transmission quality or channel quality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W40/00Communication routing or communication path finding
    • H04W40/02Communication route or path selection, e.g. power-based or shortest path routing
    • H04W40/22Communication route or path selection, e.g. power-based or shortest path routing using selective relaying for reaching a BTS [Base Transceiver Station] or an access point
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a method for selecting a relay node in M2M communication based on reinforcement learning, which comprises the following steps: s1, building a reinforcement learning model framework, and determining an action set, a state set and a reward value of reinforcement learning; s2, refining the frame of the reinforcement learning model; s3, carrying out iterative loop on the reinforcement learning model according to specific problems; and S4, adjusting the reinforcement learning model according to the training result, and selecting the relay node by applying the adjusted reinforcement learning model. The invention brings the energy consumption condition of each communication and the energy balance condition of the battery for supplying energy to the relay node into the goal of reinforcement learning, so that the system is balanced in reducing energy consumption and realizing energy balance, the energy imbalance is avoided, and the service life of the whole system is prolonged.

Description

Method for selecting relay node in M2M communication based on reinforcement learning
Technical Field
The invention relates to a relay selection method, in particular to a relay node selection method in M2M communication based on reinforcement learning aiming at battery-powered relay selection, and belongs to the technical field of wireless communication.
Background
In recent years, with the continuous development of computer technology, Machine Learning (Machine Learning) has gradually become the core technology of artificial intelligence, and has attracted extensive attention from numerous researchers in the industry.
Specifically, machine learning can be roughly classified into supervised learning, unsupervised learning, and reinforcement learning according to whether or not a tag exists in sample data in a database thereof. Wherein, the reinforcement learning is a learning method which takes environmental feedback as input and takes statistical and dynamic programming technology as guidance. The inspiration of reinforcement learning comes from the theory of behavioral senses in psychology, namely how an organism develops expectations of stimuli under the stimuli of rewards or penalties given by the environment, resulting in habitual behaviors that can gain the most benefit. Unlike the other two methods of machine learning described above, the advantages of reinforcement learning focus on the problem of the interaction of agents in an uncertain environment, emphasizing how to act based on the environment to achieve the maximum expected benefit. Currently, reinforcement learning has been widely applied in many related fields, and for the communication field, it is generally applied in the scenes of wireless resource allocation, relay selection, and the like.
In the field of wireless communication, a relay technology is one of the research hotspots in recent years, and the relay technology is to arrange a plurality of relay nodes in a wireless communication network, and the relay nodes can assist in transmitting a message to a destination node, so that the wireless channel fading is effectively overcome, the communication quality of a wireless system is improved, and the coverage of the communication system is expanded. In a further aspect of the present invention,
the processing modes of the relay node for the signal mainly comprise an amplifying forwarding mode and a decoding forwarding mode. The method for selecting the relay node generally includes four schemes, namely a maximum signal-to-noise ratio scheme, a nearest neighbor scheme, an optimal worst channel scheme and a minimum harmonic mean scheme, and the basic reference basis of the algorithms is the channel conditions among the source node, the relay node and the destination node.
When the existing relay node selection methods are executed, technicians find that relay nodes with better channel conditions in the whole wireless communication network are frequently selected for information transmission, and battery energy for supplying power to the relay nodes is consumed faster than other relay nodes, so that energy imbalance in the system is caused, and the service life of the whole system is shortened.
In summary, how to provide a new relay node selection method based on the prior art to overcome the defects in the prior art as much as possible is a common research objective of those skilled in the art.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for selecting a relay node in M2M communication based on reinforcement learning for battery-powered relay selection, which is as follows.
A method for selecting a relay node in M2M communication based on reinforcement learning is applied to an M2M communication system and comprises the following steps:
s1, building a reinforcement learning model framework, and determining an action set, a state set and a reward value of reinforcement learning;
s2, refining the frame of the reinforcement Learning model by adopting a Q-Learning algorithm;
s3, determining hyper-parameters in the Q-Learning algorithm according to specific problems, and performing iterative loop on the reinforcement Learning model;
and S4, adjusting the reinforcement learning model according to the training result, and selecting the relay node by applying the adjusted reinforcement learning model.
Preferably, the S1 includes the following steps:
s11, selecting and defining different relay nodes in the M2M communication system as actions in reinforcement learning;
s12, setting the residual energy of the battery of each relay node and the target node of the next communication as the state of reinforcement learning;
and S13, calculating the energy consumed by each relay node in the single communication selection process according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system.
Preferably, the S11 includes the steps of,
recording the number of communication devices in the M2M communication system as N, recording the number of relay nodes as M, and recording the agent action as a ∈ A { (1, 2, …, M }, namely the number of the selected relay node;
the S12 includes the steps of,
when the remaining energy of the battery of the relay node with the number i is Pi, the number of the communication destination device is n, and the reinforcement learning state S is
S=[P1,P2,…,PM,n];
The S13 includes the steps of,
according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system, the energy consumed by each relay node in the single communication selection process is calculated by adopting a decoding and forwarding mode, then the energy consumption and the balance condition of a battery of the relay node are integrated to be used as a reward of reinforcement learning, the multi-objective problem is converted into the single-objective problem, and the adjustment of the intention balance points of the multi-objective problem and the relay node is completed through parameter adjustment.
Preferably, in S13, the energy consumption of the relay node is denoted as E, and the energy balance of the relay node is denoted as DmaxThe difference between the maximum energy and the minimum energy of the battery of the relay node is represented, and the reward R of the reinforcement learning is
R=r0eE-δξdDmax
Wherein, ξeAnd ξdRespectively representing the weight occupied by the energy consumption and the energy balance, delta representing the mapping of the energy consumption and the energy balance to a similar range, r0Is a positive number with a value range of
max(ξeE+δξdDmax)<r0<2*max(ξeE+δξdDmax)。
Preferably, the S2 includes the following steps:
s21, performing barrel distribution according to the energy consumed by each relay node;
s22, taking the minimum value of the battery energy of each relay node, subtracting the minimum energy of the battery from the energy characteristic in the reinforcement learning state, and reflecting the energy balance condition of the relay nodes by the result;
and S23, updating the action selection strategy.
Preferably, the S21 includes the steps of,
performing barrel separation according to the energy consumed by each relay node in the single communication selection process obtained in the step S1, so that the energy consumption is reduced from an infinite dimension to a finite dimension;
the S22 includes the steps of,
let the minimum energy of the battery of each relay node be Pm0nThe updated remaining energy of the battery of the relay node with the number i is denoted as piModifying the reinforcement learning state S to
S=[p1,p2,…,pM,n],
Wherein p isi=P0-Pmin
The S23 includes the steps of,
the value function Q (s, a) of the state-action pair in the Q-Learning algorithm is adopted for updating, and the updating process is that
Qt+1(s,a)=(1-α)Qt(s,a)+α[Rt+γmaxb∈AQt(s’,b)],
Wherein s represents the current state, s' represents the state after state transition, QtAnd Qt+1Q values before and after update are respectively represented, α represents learning rate of reinforcement learning, α value represents that the update of the reward is more dependent on the instant reward, and RtThe method comprises the steps of representing reward R obtained by current iteration, wherein gamma is a discount factor, represents the importance degree of future reward and is less than 1;
the strategy pi (s, a) in the Q-Learning algorithm represents the probability of selecting a action in the s state, and the action selection strategy is updated by adopting a Softmax function.
Preferably, the S3 includes the following steps:
s31, initializing the reinforcement learning modelThe initialization reward value R is 0, the initialization action selection policy pi (s, a) is 1/M, and the weighting value ξ is determined according to the desire for energy consumption and energy balanceeAnd ξd,ξeAnd ξdThe sum of both equals 1;
s32, determining the learning rate α of reinforcement learning, determining the maximum iteration number K, and performing multiple iterations.
Preferably, the S32 includes the following steps:
obtaining the probability of selecting each action according to the action selection strategy pi (s, a), selecting any relay node according to the probability to transmit data, calculating a reward value according to the energy consumption of the communication and the energy balance condition after the communication is finished after one-time communication is finished, updating a value function Q (s, a), then calculating a new action selection strategy by using a Softmax function according to a new value function to obtain the next communication task, and updating the state of a model when a target node needing to transmit messages is determined.
Preferably, in said S4,
if the reinforcement learning model is not converged after the set maximum iteration times are reached, adjusting the learning rate to be smaller, and carrying out iteration training again;
in the iteration process, before sending the message each time, the expected value of the accumulated reward value obtained by selecting each relay node is calculated according to the state, and the relay node with the largest reward is selected to forward the message.
Compared with the prior art, the invention has the advantages that:
the invention provides a relay node selection method in M2M communication based on reinforcement learning, which brings the energy consumption condition of each communication and the energy balance condition of a battery for supplying energy to the relay node into the goal of reinforcement learning, so that the system is balanced in reducing energy consumption and realizing energy balance, energy imbalance is avoided, and the service life of the whole system is prolonged.
Meanwhile, the method of the invention can lead the system to autonomously finish the selection of the relay node, has strong self-adaptability, and does not need to reset the selection mode of the relay node even if new M2M equipment is added into the system midway, thereby providing convenience for the practical application of the method of the invention.
In addition, the method also provides reference for other related problems in the same field, has very wide application prospect, can be expanded and extended on the basis of the problem, and is applied to other technical schemes related to relay node selection in a wireless communication system.
The following detailed description of the embodiments of the present invention is provided in connection with the accompanying drawings for the purpose of facilitating understanding and understanding of the technical solutions of the present invention.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of a system model according to the present invention.
Detailed Description
The invention discloses a method for selecting a relay node in M2M communication based on reinforcement learning, and the scheme of the invention is further explained by combining with an embodiment.
The overall idea of the invention is as follows: by utilizing the characteristic of low mobility of the M2M communication equipment, the energy consumed by selecting each relay node in single communication is calculated by adopting a Decode and forward (Decode and Forward) mode according to the channel condition from each relay node to the M2M communication equipment and the channel condition from the source node to each relay node. And then selecting the relay nodes as actions (actions) in reinforcement Learning, combining the energy consumed by single communication with the balance of the battery energy of each relay node to serve as rewards (rewards) in reinforcement Learning, taking the battery residual energy of each relay node and the target node of next communication as the states of reinforcement Learning, and carrying out iterative operation by adopting an algorithm of Q-Learning (Q-Learning) and an Action selection strategy of a Softmax function until convergence.
As shown in fig. 1 to fig. 2, a method for selecting a relay node in M2M communication based on reinforcement learning is applied to an M2M communication system, where the system includes a source node, a relay node, and an M2M communication device, and the method includes the following steps:
s1, building a reinforcement learning model framework, and determining an action set, a state set and a reward value of reinforcement learning;
s2, refining the frame of the reinforcement Learning model by adopting a Q-Learning algorithm;
s3, determining hyper-parameters in the Q-Learning algorithm according to specific problems, and performing iterative loop on the reinforcement Learning model;
and S4, adjusting the reinforcement learning model according to the training result, and selecting the relay node by applying the adjusted reinforcement learning model.
Specifically, the S1 includes the following steps:
s11, selecting and defining different relay nodes in the M2M communication system as actions in reinforcement learning; that is, the number of communication devices in the M2M communication system is denoted as N, the number of relay nodes is denoted as M, and the agent action is denoted as a ∈ a ═ 1,2, …, M }, i.e., the number of the selected relay node.
S12, setting the residual energy of the battery of each relay node and the target node of the next communication as the state of reinforcement learning; that is, the remaining energy of the battery of the relay node with the number i is Pi, the number of the communication destination device is n, and the reinforcement learning state S is
S=[P1,P2,…,PM,n]。
S13, calculating energy consumed by each relay node in the single communication selection process according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system;
the method comprises the steps of calculating energy consumed by each relay node in a single communication selection process by adopting a Decode and Forward (DF) forwarding mode according to the channel condition from each relay node to a communication device in an M2M communication system and the channel condition from a source node to each relay node in the M2M communication system, then integrating the energy consumption and the balance condition of a battery of the relay node as a reward for reinforcement learning, converting a multi-objective problem into a single-objective problem, and completing adjustment of an intentional balance point of the multi-objective problem and the single-objective problem through parameter adjustment.
In S13, energy consumption of the relay node is denoted as E, and energy balance of the relay node is denoted as DmaxThe difference between the maximum energy and the minimum energy of the battery of the relay node is represented, and the reward R of the reinforcement learning is
R=r0eE-δξdDmax
Wherein, ξeAnd ξdRespectively representing the weight occupied by energy consumption and energy balance; δ represents the mapping of energy consumption and energy balance to a similar range; r is0A positive number, avoiding a negative reward value, can be adjusted according to the desired requirements for energy consumption and energy balance, typically over a range of values
max(ξeE+δξdDmax)<r0<2*max(ξeE+δξdDmax)。
Specifically, the S2 includes the following steps:
s21, performing barrel distribution according to the energy consumed by each relay node; that is, the energy consumed by each relay node in the single communication selection process obtained in S1 is bucketized, so that the energy consumption is reduced from an infinite dimension to a finite dimension.
S22, taking the minimum value of the battery energy of each relay node, subtracting the minimum energy of the battery from the energy characteristic in the reinforcement learning state, and reflecting the energy balance condition of the relay nodes by the result; namely, the minimum energy of the battery of each relay node is recorded as PminThe remaining energy of the battery of the relay node numbered i after updating is recorded as piModifying the reinforcement learning state S to
S=[p1,p2,…,pM,n],
Wherein p isi=Pi-Pmin
S23, updating the action selection strategy; namely, the value function Q (s, a) of the state-action pair in the Q-Learning algorithm is adopted for updating, and the updating process is that
Qt+1(s,a)=(1-α)Qt(s,a)+α[Rt+γmaxb∈AQt(s’,b)],
Wherein s represents the current state, s' represents the state after state transition, QtAnd Qt+1Q values before and after update are respectively represented, α represents learning rate of reinforcement learning, α value represents that the update of the reward is more dependent on the instant reward, and RtThe reward R obtained by current iteration is represented, gamma is a discount factor, represents the importance degree of future reward, and the value is usually less than 1;
the strategy pi (s, a) in the Q-Learning algorithm represents the probability of selecting a action in the s state, and the action selection strategy is updated by adopting a Softmax function.
Specifically, the S3 includes the following steps:
s31, initializing the reinforcement learning model, setting the initial reward value R to 0, setting the initial action selection policy pi (S, a) to 1/M, and determining the weight value ξ according to the desire for energy consumption and energy balanceeAnd ξd,ξeAnd ξdThe sum of both equals 1;
s32, determining a learning rate α of reinforcement learning, determining a maximum iteration number K, and performing multiple iterations, it should be noted that the learning rate α has a value which is too small, so that the model learning is slow and poor in previous period, and too large, so that the model cannot be converged finally.
Obtaining the probability of selecting each action according to the action selection strategy pi (s, a), selecting any relay node according to the probability to transmit data, and after finishing one communication, according to the energy consumption E of the communication and the energy balance condition D after the communication is finishedmaxCalculating a reward value R and updating a value function Q (s, a), then calculating a new action selection strategy by using a Softmax function according to the new value function, obtaining the next communication task, and updating the state of the model when determining a target node needing to send a message.
Specifically, in the step S4,
if the reinforcement learning model is not converged after the set maximum iteration times are reached, adjusting the learning rate to be smaller, and carrying out iteration training again;
in the iteration process, before sending the message each time, the expected value of the accumulated reward value obtained by selecting each relay node is calculated according to the state, and the relay node with the largest reward is selected to forward the message.
The selection of the relay nodes is carried out by applying the model, so that the battery energy of each relay node can be ensured to be balanced under the condition of less energy consumption, the expected values of the relay nodes and the relay nodes can be adjusted according to the result, and the expected balance point can be searched between the relay nodes and the relay nodes.
In summary, according to the selection method of the relay node in M2M communication based on reinforcement learning provided by the present invention, the energy consumption of each communication and the energy balance of the battery supplying energy to the relay node are all included in the reinforcement learning target, so that the system is balanced in reducing energy consumption and realizing energy balance, energy imbalance is avoided, and the service life of the whole system is prolonged.
Meanwhile, the method of the invention can lead the system to autonomously finish the selection of the relay node, has strong self-adaptability, and does not need to reset the selection mode of the relay node even if new M2M equipment is added into the system midway, thereby providing convenience for the practical application of the method of the invention.
In addition, the method also provides reference for other related problems in the same field, has very wide application prospect, can be expanded and extended on the basis of the problem, and is applied to other technical schemes related to relay node selection in a wireless communication system.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (9)

1. A method for selecting a relay node in M2M communication based on reinforcement learning is applied to an M2M communication system, and is characterized by comprising the following steps:
s1, building a reinforcement learning model framework, and determining an action set, a state set and a reward value of reinforcement learning;
s2, refining the frame of the reinforcement Learning model by adopting a Q-Learning algorithm;
s3, determining hyper-parameters in the Q-Learning algorithm according to specific problems, and performing iterative loop on the reinforcement Learning model;
and S4, adjusting the reinforcement learning model according to the training result, and selecting the relay node by applying the adjusted reinforcement learning model.
2. The method for selecting a relay node in reinforcement learning based M2M communication according to claim 1, wherein the S1 comprises the following steps:
s11, selecting and defining different relay nodes in the M2M communication system as actions in reinforcement learning;
s12, setting the residual energy of the battery of each relay node and the target node of the next communication as the state of reinforcement learning;
and S13, calculating the energy consumed by each relay node in the single communication selection process according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system.
3. The method of claim 2, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises:
the S11 includes the steps of,
recording the number of communication devices in an M2M communication system as N, recording the number of relay nodes as M, and recording the agent action as a ∈ A [ {1, 2., M }, namely the number of the selected relay node;
the S12 includes the steps of,
when the remaining energy of the battery of the relay node with the number i is Pi, the number of the communication destination device is n, and the reinforcement learning state S is
S=[P1,P2,...,PM,n];
The S13 includes the steps of,
according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system, the energy consumed by each relay node in the single communication selection process is calculated by adopting a decoding and forwarding mode, then the energy consumption and the balance condition of a battery of the relay node are integrated to be used as a reward of reinforcement learning, the multi-objective problem is converted into the single-objective problem, and the adjustment of the intention balance points of the multi-objective problem and the relay node is completed through parameter adjustment.
4. The method of claim 3, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises:
in S13, energy consumption of the relay node is denoted as E, and energy balance of the relay node is denoted as DmaxThe difference between the maximum energy and the minimum energy of the battery of the relay node is represented, and the reward R of the reinforcement learning is
R=r0eE-δξdDmax
Wherein, ξeAnd ξdRespectively representing the weight occupied by the energy consumption and the energy balance, delta representing the mapping of the energy consumption and the energy balance to a similar range, r0Is a positive number with a value range of
max(ξeE+δξdDmax)<r0<2*max(ξeE+δξdDmax)。
5. The method for selecting a relay node in M2M communication based on reinforcement learning of claim 4, wherein the S2 comprises the following steps:
s21, performing barrel distribution according to the energy consumed by each relay node;
s22, taking the minimum value of the battery energy of each relay node, subtracting the minimum energy of the battery from the energy characteristic in the reinforcement learning state, and reflecting the energy balance condition of the relay nodes by the result;
and S23, updating the action selection strategy.
6. The method of claim 5, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises:
the S21 includes the steps of,
performing barrel separation according to the energy consumed by each relay node in the single communication selection process obtained in the step S1, so that the energy consumption is reduced from an infinite dimension to a finite dimension;
the S22 includes the steps of,
let the minimum energy of the battery of each relay node be PminThe updated remaining energy of the battery of the relay node with the number i is denoted as piModifying the reinforcement learning state S to
S=[p1,p2,...,pM,n],
Wherein p isi=Pi-Pmin
The S23 includes the steps of,
the value function Q (s, a) of the state-action pair in the Q-Learning algorithm is adopted for updating, and the updating process is that
Qt+1(s,a)=(1-α)Qt(s,a)+α[Rt+γmaxb∈AQt(s’,b)],
Wherein s represents the current state, s' represents the state after state transition, QtAnd Qt+1Q values before and after update are respectively represented, α represents learning rate of reinforcement learning, α value represents that the update of the reward is more dependent on the instant reward, and RtThe method comprises the steps of representing reward R obtained by current iteration, wherein gamma is a discount factor, represents the importance degree of future reward and is less than 1;
the strategy pi (s, a) in the Q-Learning algorithm represents the probability of selecting a action in the s state, and the action selection strategy is updated by adopting a Softmax function.
7. The method for selecting a relay node in M2M communication based on reinforcement learning of claim 6, wherein the S3 comprises the following steps:
s31, initializing the reinforcement learning model, setting the initial reward value R to 0, setting the initial action selection policy pi (S, a) to 1/M, and determining the weight value ξ according to the desire for energy consumption and energy balanceeAnd ξd,ξeAnd ξdThe sum of both equals 1;
s32, determining the learning rate α of reinforcement learning, determining the maximum iteration number K, and performing multiple iterations.
8. The method of claim 7, wherein the S32 comprises:
obtaining the probability of selecting each action according to the action selection strategy pi (s, a), selecting any relay node according to the probability to transmit data, calculating a reward value according to the energy consumption of the communication and the energy balance condition after the communication is finished after one-time communication is finished, updating a value function Q (s, a), then calculating a new action selection strategy by using a Softmax function according to a new value function to obtain the next communication task, and updating the state of a model when a target node needing to transmit messages is determined.
9. The method of claim 8, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises: in the above-mentioned S4, the method,
if the reinforcement learning model is not converged after the set maximum iteration times are reached, adjusting the learning rate to be smaller, and carrying out iteration training again;
in the iteration process, before sending the message each time, the expected value of the accumulated reward value obtained by selecting each relay node is calculated according to the state, and the relay node with the largest reward is selected to forward the message.
CN202010040442.4A 2020-01-15 2020-01-15 Method for selecting relay node in M2M communication based on reinforcement learning Active CN111246438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010040442.4A CN111246438B (en) 2020-01-15 2020-01-15 Method for selecting relay node in M2M communication based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010040442.4A CN111246438B (en) 2020-01-15 2020-01-15 Method for selecting relay node in M2M communication based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN111246438A true CN111246438A (en) 2020-06-05
CN111246438B CN111246438B (en) 2023-03-24

Family

ID=70877794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010040442.4A Active CN111246438B (en) 2020-01-15 2020-01-15 Method for selecting relay node in M2M communication based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN111246438B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112039767A (en) * 2020-08-11 2020-12-04 山东大学 Multi-data center energy-saving routing method and system based on reinforcement learning
CN113726664A (en) * 2021-08-03 2021-11-30 哈尔滨工程大学 Internet of vehicles node forwarding utility learning method based on double updating strategies
CN113780482A (en) * 2021-11-12 2021-12-10 中国科学院理化技术研究所 Intelligent detection method for abnormity of energy router

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106304239A (en) * 2016-08-12 2017-01-04 梁广俊 The relay selection method of energy acquisition multi-relay cooperation communication system
CN108282238A (en) * 2018-01-29 2018-07-13 厦门大学 A kind of relay selection method and device based on network coding technique
CN108712760A (en) * 2018-03-29 2018-10-26 北京邮电大学 High-throughput relay selection method based on random Learning Automata and fuzzy algorithmic approach
CN108848540A (en) * 2018-06-20 2018-11-20 中国联合网络通信集团有限公司 Relay node selecting method and system
CN109195207A (en) * 2018-07-19 2019-01-11 浙江工业大学 A kind of energy-collecting type wireless relay network througput maximization approach based on deeply study
CN110636523A (en) * 2019-09-20 2019-12-31 中南大学 Millimeter wave mobile backhaul link energy efficiency stabilization scheme based on Q learning
CN110650480A (en) * 2019-09-24 2020-01-03 南京邮电大学 Wireless network relay and interference node selection method facing safe communication

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106304239A (en) * 2016-08-12 2017-01-04 梁广俊 The relay selection method of energy acquisition multi-relay cooperation communication system
CN108282238A (en) * 2018-01-29 2018-07-13 厦门大学 A kind of relay selection method and device based on network coding technique
CN108712760A (en) * 2018-03-29 2018-10-26 北京邮电大学 High-throughput relay selection method based on random Learning Automata and fuzzy algorithmic approach
CN108848540A (en) * 2018-06-20 2018-11-20 中国联合网络通信集团有限公司 Relay node selecting method and system
CN109195207A (en) * 2018-07-19 2019-01-11 浙江工业大学 A kind of energy-collecting type wireless relay network througput maximization approach based on deeply study
CN110636523A (en) * 2019-09-20 2019-12-31 中南大学 Millimeter wave mobile backhaul link energy efficiency stabilization scheme based on Q learning
CN110650480A (en) * 2019-09-24 2020-01-03 南京邮电大学 Wireless network relay and interference node selection method facing safe communication

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YUHAN SU等: "Cooperative Communications With Relay Selection Based on Deep Reinforcement Learning in Wireless Sensor Networks", 《IEEE SENSORS JOURNAL》 *
吴子秋: "M2M通信中基于深度强化学习的中继选择算法", 《中国硕士学位论文全文数据库(电子期刊) 信息科技辑》 *
宋夫静: "基于增强学习的协作通信中继选择方案", 《信息技术与信息化》 *
郑姗姗: "M2M通信中基于机器学习的无线资源管理的研究", 《中国硕士学位论文全文数据库(电子期刊) 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112039767A (en) * 2020-08-11 2020-12-04 山东大学 Multi-data center energy-saving routing method and system based on reinforcement learning
CN113726664A (en) * 2021-08-03 2021-11-30 哈尔滨工程大学 Internet of vehicles node forwarding utility learning method based on double updating strategies
CN113780482A (en) * 2021-11-12 2021-12-10 中国科学院理化技术研究所 Intelligent detection method for abnormity of energy router

Also Published As

Publication number Publication date
CN111246438B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN111246438B (en) Method for selecting relay node in M2M communication based on reinforcement learning
CN110488861B (en) Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle
CN110809306B (en) Terminal access selection method based on deep reinforcement learning
Iqbal et al. Double deep Q-network-based energy-efficient resource allocation in cloud radio access network
CN108075975B (en) Method and system for determining route transmission path in Internet of things environment
CN111491358B (en) Adaptive modulation and power control system based on energy acquisition and optimization method
CN109769280B (en) WIFI intelligent prediction switching method based on machine learning
Esmaeili et al. CMML: Combined metaheuristic-machine learning for adaptable routing in clustered wireless sensor networks
CN113473480B (en) Improved reinforcement learning network coverage optimization method facing cellular network
CN111865474B (en) Wireless communication anti-interference decision method and system based on edge calculation
CN115065678A (en) Multi-intelligent-device task unloading decision method based on deep reinforcement learning
Lu et al. Learning deterministic policy with target for power control in wireless networks
CN117641542B (en) Method for online optimization of power consumption strategy of terminal of Internet of things
CN114710439B (en) Network energy consumption and throughput joint optimization routing method based on deep reinforcement learning
CN116185523A (en) Task unloading and deployment method
Xu et al. Energy consumption minimization for data collection from wirelessly-powered IoT sensors: Session-specific optimal design with DRL
Song et al. Fast-DRD: Fast decentralized reinforcement distillation for deadline-aware edge computing
Yang et al. On sampling time maximization in wireless powered Internet of Things
CN102077526A (en) Method, apparatus and computer program product for distributed information management
Paterova et al. Data-driven self-learning controller design approach for power-aware iot devices based on double q-learning strategy
CN116976523A (en) Distributed economic dispatching method based on partially observable reinforcement learning
CN110705756A (en) Electric power energy consumption optimization control method based on input convex neural network
Zhao et al. MEDIA: An incremental DNN based computation offloading for collaborative cloud-edge computing
CN114611823B (en) Optimized dispatching method and system for electricity-cold-heat-gas multi-energy-demand typical park
CN113938978B (en) Heterogeneous wireless sensor path finding method based on reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant