CN111246438A - Method for selecting relay node in M2M communication based on reinforcement learning - Google Patents
Method for selecting relay node in M2M communication based on reinforcement learning Download PDFInfo
- Publication number
- CN111246438A CN111246438A CN202010040442.4A CN202010040442A CN111246438A CN 111246438 A CN111246438 A CN 111246438A CN 202010040442 A CN202010040442 A CN 202010040442A CN 111246438 A CN111246438 A CN 111246438A
- Authority
- CN
- China
- Prior art keywords
- relay node
- reinforcement learning
- communication
- energy
- selecting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/70—Services for machine-to-machine communication [M2M] or machine type communication [MTC]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W40/00—Communication routing or communication path finding
- H04W40/02—Communication route or path selection, e.g. power-based or shortest path routing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W40/00—Communication routing or communication path finding
- H04W40/02—Communication route or path selection, e.g. power-based or shortest path routing
- H04W40/04—Communication route or path selection, e.g. power-based or shortest path routing based on wireless node resources
- H04W40/10—Communication route or path selection, e.g. power-based or shortest path routing based on wireless node resources based on available power or energy
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W40/00—Communication routing or communication path finding
- H04W40/02—Communication route or path selection, e.g. power-based or shortest path routing
- H04W40/12—Communication route or path selection, e.g. power-based or shortest path routing based on transmission quality or channel quality
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W40/00—Communication routing or communication path finding
- H04W40/02—Communication route or path selection, e.g. power-based or shortest path routing
- H04W40/22—Communication route or path selection, e.g. power-based or shortest path routing using selective relaying for reaching a BTS [Base Transceiver Station] or an access point
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a method for selecting a relay node in M2M communication based on reinforcement learning, which comprises the following steps: s1, building a reinforcement learning model framework, and determining an action set, a state set and a reward value of reinforcement learning; s2, refining the frame of the reinforcement learning model; s3, carrying out iterative loop on the reinforcement learning model according to specific problems; and S4, adjusting the reinforcement learning model according to the training result, and selecting the relay node by applying the adjusted reinforcement learning model. The invention brings the energy consumption condition of each communication and the energy balance condition of the battery for supplying energy to the relay node into the goal of reinforcement learning, so that the system is balanced in reducing energy consumption and realizing energy balance, the energy imbalance is avoided, and the service life of the whole system is prolonged.
Description
Technical Field
The invention relates to a relay selection method, in particular to a relay node selection method in M2M communication based on reinforcement learning aiming at battery-powered relay selection, and belongs to the technical field of wireless communication.
Background
In recent years, with the continuous development of computer technology, Machine Learning (Machine Learning) has gradually become the core technology of artificial intelligence, and has attracted extensive attention from numerous researchers in the industry.
Specifically, machine learning can be roughly classified into supervised learning, unsupervised learning, and reinforcement learning according to whether or not a tag exists in sample data in a database thereof. Wherein, the reinforcement learning is a learning method which takes environmental feedback as input and takes statistical and dynamic programming technology as guidance. The inspiration of reinforcement learning comes from the theory of behavioral senses in psychology, namely how an organism develops expectations of stimuli under the stimuli of rewards or penalties given by the environment, resulting in habitual behaviors that can gain the most benefit. Unlike the other two methods of machine learning described above, the advantages of reinforcement learning focus on the problem of the interaction of agents in an uncertain environment, emphasizing how to act based on the environment to achieve the maximum expected benefit. Currently, reinforcement learning has been widely applied in many related fields, and for the communication field, it is generally applied in the scenes of wireless resource allocation, relay selection, and the like.
In the field of wireless communication, a relay technology is one of the research hotspots in recent years, and the relay technology is to arrange a plurality of relay nodes in a wireless communication network, and the relay nodes can assist in transmitting a message to a destination node, so that the wireless channel fading is effectively overcome, the communication quality of a wireless system is improved, and the coverage of the communication system is expanded. In a further aspect of the present invention,
the processing modes of the relay node for the signal mainly comprise an amplifying forwarding mode and a decoding forwarding mode. The method for selecting the relay node generally includes four schemes, namely a maximum signal-to-noise ratio scheme, a nearest neighbor scheme, an optimal worst channel scheme and a minimum harmonic mean scheme, and the basic reference basis of the algorithms is the channel conditions among the source node, the relay node and the destination node.
When the existing relay node selection methods are executed, technicians find that relay nodes with better channel conditions in the whole wireless communication network are frequently selected for information transmission, and battery energy for supplying power to the relay nodes is consumed faster than other relay nodes, so that energy imbalance in the system is caused, and the service life of the whole system is shortened.
In summary, how to provide a new relay node selection method based on the prior art to overcome the defects in the prior art as much as possible is a common research objective of those skilled in the art.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for selecting a relay node in M2M communication based on reinforcement learning for battery-powered relay selection, which is as follows.
A method for selecting a relay node in M2M communication based on reinforcement learning is applied to an M2M communication system and comprises the following steps:
s1, building a reinforcement learning model framework, and determining an action set, a state set and a reward value of reinforcement learning;
s2, refining the frame of the reinforcement Learning model by adopting a Q-Learning algorithm;
s3, determining hyper-parameters in the Q-Learning algorithm according to specific problems, and performing iterative loop on the reinforcement Learning model;
and S4, adjusting the reinforcement learning model according to the training result, and selecting the relay node by applying the adjusted reinforcement learning model.
Preferably, the S1 includes the following steps:
s11, selecting and defining different relay nodes in the M2M communication system as actions in reinforcement learning;
s12, setting the residual energy of the battery of each relay node and the target node of the next communication as the state of reinforcement learning;
and S13, calculating the energy consumed by each relay node in the single communication selection process according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system.
Preferably, the S11 includes the steps of,
recording the number of communication devices in the M2M communication system as N, recording the number of relay nodes as M, and recording the agent action as a ∈ A { (1, 2, …, M }, namely the number of the selected relay node;
the S12 includes the steps of,
when the remaining energy of the battery of the relay node with the number i is Pi, the number of the communication destination device is n, and the reinforcement learning state S is
S=[P1,P2,…,PM,n];
The S13 includes the steps of,
according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system, the energy consumed by each relay node in the single communication selection process is calculated by adopting a decoding and forwarding mode, then the energy consumption and the balance condition of a battery of the relay node are integrated to be used as a reward of reinforcement learning, the multi-objective problem is converted into the single-objective problem, and the adjustment of the intention balance points of the multi-objective problem and the relay node is completed through parameter adjustment.
Preferably, in S13, the energy consumption of the relay node is denoted as E, and the energy balance of the relay node is denoted as DmaxThe difference between the maximum energy and the minimum energy of the battery of the relay node is represented, and the reward R of the reinforcement learning is
R=r0-ξeE-δξdDmax,
Wherein, ξeAnd ξdRespectively representing the weight occupied by the energy consumption and the energy balance, delta representing the mapping of the energy consumption and the energy balance to a similar range, r0Is a positive number with a value range of
max(ξeE+δξdDmax)<r0<2*max(ξeE+δξdDmax)。
Preferably, the S2 includes the following steps:
s21, performing barrel distribution according to the energy consumed by each relay node;
s22, taking the minimum value of the battery energy of each relay node, subtracting the minimum energy of the battery from the energy characteristic in the reinforcement learning state, and reflecting the energy balance condition of the relay nodes by the result;
and S23, updating the action selection strategy.
Preferably, the S21 includes the steps of,
performing barrel separation according to the energy consumed by each relay node in the single communication selection process obtained in the step S1, so that the energy consumption is reduced from an infinite dimension to a finite dimension;
the S22 includes the steps of,
let the minimum energy of the battery of each relay node be Pm0nThe updated remaining energy of the battery of the relay node with the number i is denoted as piModifying the reinforcement learning state S to
S=[p1,p2,…,pM,n],
Wherein p isi=P0-Pmin;
The S23 includes the steps of,
the value function Q (s, a) of the state-action pair in the Q-Learning algorithm is adopted for updating, and the updating process is that
Qt+1(s,a)=(1-α)Qt(s,a)+α[Rt+γmaxb∈AQt(s’,b)],
Wherein s represents the current state, s' represents the state after state transition, QtAnd Qt+1Q values before and after update are respectively represented, α represents learning rate of reinforcement learning, α value represents that the update of the reward is more dependent on the instant reward, and RtThe method comprises the steps of representing reward R obtained by current iteration, wherein gamma is a discount factor, represents the importance degree of future reward and is less than 1;
the strategy pi (s, a) in the Q-Learning algorithm represents the probability of selecting a action in the s state, and the action selection strategy is updated by adopting a Softmax function.
Preferably, the S3 includes the following steps:
s31, initializing the reinforcement learning modelThe initialization reward value R is 0, the initialization action selection policy pi (s, a) is 1/M, and the weighting value ξ is determined according to the desire for energy consumption and energy balanceeAnd ξd,ξeAnd ξdThe sum of both equals 1;
s32, determining the learning rate α of reinforcement learning, determining the maximum iteration number K, and performing multiple iterations.
Preferably, the S32 includes the following steps:
obtaining the probability of selecting each action according to the action selection strategy pi (s, a), selecting any relay node according to the probability to transmit data, calculating a reward value according to the energy consumption of the communication and the energy balance condition after the communication is finished after one-time communication is finished, updating a value function Q (s, a), then calculating a new action selection strategy by using a Softmax function according to a new value function to obtain the next communication task, and updating the state of a model when a target node needing to transmit messages is determined.
Preferably, in said S4,
if the reinforcement learning model is not converged after the set maximum iteration times are reached, adjusting the learning rate to be smaller, and carrying out iteration training again;
in the iteration process, before sending the message each time, the expected value of the accumulated reward value obtained by selecting each relay node is calculated according to the state, and the relay node with the largest reward is selected to forward the message.
Compared with the prior art, the invention has the advantages that:
the invention provides a relay node selection method in M2M communication based on reinforcement learning, which brings the energy consumption condition of each communication and the energy balance condition of a battery for supplying energy to the relay node into the goal of reinforcement learning, so that the system is balanced in reducing energy consumption and realizing energy balance, energy imbalance is avoided, and the service life of the whole system is prolonged.
Meanwhile, the method of the invention can lead the system to autonomously finish the selection of the relay node, has strong self-adaptability, and does not need to reset the selection mode of the relay node even if new M2M equipment is added into the system midway, thereby providing convenience for the practical application of the method of the invention.
In addition, the method also provides reference for other related problems in the same field, has very wide application prospect, can be expanded and extended on the basis of the problem, and is applied to other technical schemes related to relay node selection in a wireless communication system.
The following detailed description of the embodiments of the present invention is provided in connection with the accompanying drawings for the purpose of facilitating understanding and understanding of the technical solutions of the present invention.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of a system model according to the present invention.
Detailed Description
The invention discloses a method for selecting a relay node in M2M communication based on reinforcement learning, and the scheme of the invention is further explained by combining with an embodiment.
The overall idea of the invention is as follows: by utilizing the characteristic of low mobility of the M2M communication equipment, the energy consumed by selecting each relay node in single communication is calculated by adopting a Decode and forward (Decode and Forward) mode according to the channel condition from each relay node to the M2M communication equipment and the channel condition from the source node to each relay node. And then selecting the relay nodes as actions (actions) in reinforcement Learning, combining the energy consumed by single communication with the balance of the battery energy of each relay node to serve as rewards (rewards) in reinforcement Learning, taking the battery residual energy of each relay node and the target node of next communication as the states of reinforcement Learning, and carrying out iterative operation by adopting an algorithm of Q-Learning (Q-Learning) and an Action selection strategy of a Softmax function until convergence.
As shown in fig. 1 to fig. 2, a method for selecting a relay node in M2M communication based on reinforcement learning is applied to an M2M communication system, where the system includes a source node, a relay node, and an M2M communication device, and the method includes the following steps:
s1, building a reinforcement learning model framework, and determining an action set, a state set and a reward value of reinforcement learning;
s2, refining the frame of the reinforcement Learning model by adopting a Q-Learning algorithm;
s3, determining hyper-parameters in the Q-Learning algorithm according to specific problems, and performing iterative loop on the reinforcement Learning model;
and S4, adjusting the reinforcement learning model according to the training result, and selecting the relay node by applying the adjusted reinforcement learning model.
Specifically, the S1 includes the following steps:
s11, selecting and defining different relay nodes in the M2M communication system as actions in reinforcement learning; that is, the number of communication devices in the M2M communication system is denoted as N, the number of relay nodes is denoted as M, and the agent action is denoted as a ∈ a ═ 1,2, …, M }, i.e., the number of the selected relay node.
S12, setting the residual energy of the battery of each relay node and the target node of the next communication as the state of reinforcement learning; that is, the remaining energy of the battery of the relay node with the number i is Pi, the number of the communication destination device is n, and the reinforcement learning state S is
S=[P1,P2,…,PM,n]。
S13, calculating energy consumed by each relay node in the single communication selection process according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system;
the method comprises the steps of calculating energy consumed by each relay node in a single communication selection process by adopting a Decode and Forward (DF) forwarding mode according to the channel condition from each relay node to a communication device in an M2M communication system and the channel condition from a source node to each relay node in the M2M communication system, then integrating the energy consumption and the balance condition of a battery of the relay node as a reward for reinforcement learning, converting a multi-objective problem into a single-objective problem, and completing adjustment of an intentional balance point of the multi-objective problem and the single-objective problem through parameter adjustment.
In S13, energy consumption of the relay node is denoted as E, and energy balance of the relay node is denoted as DmaxThe difference between the maximum energy and the minimum energy of the battery of the relay node is represented, and the reward R of the reinforcement learning is
R=r0-ξeE-δξdDmax,
Wherein, ξeAnd ξdRespectively representing the weight occupied by energy consumption and energy balance; δ represents the mapping of energy consumption and energy balance to a similar range; r is0A positive number, avoiding a negative reward value, can be adjusted according to the desired requirements for energy consumption and energy balance, typically over a range of values
max(ξeE+δξdDmax)<r0<2*max(ξeE+δξdDmax)。
Specifically, the S2 includes the following steps:
s21, performing barrel distribution according to the energy consumed by each relay node; that is, the energy consumed by each relay node in the single communication selection process obtained in S1 is bucketized, so that the energy consumption is reduced from an infinite dimension to a finite dimension.
S22, taking the minimum value of the battery energy of each relay node, subtracting the minimum energy of the battery from the energy characteristic in the reinforcement learning state, and reflecting the energy balance condition of the relay nodes by the result; namely, the minimum energy of the battery of each relay node is recorded as PminThe remaining energy of the battery of the relay node numbered i after updating is recorded as piModifying the reinforcement learning state S to
S=[p1,p2,…,pM,n],
Wherein p isi=Pi-Pmin。
S23, updating the action selection strategy; namely, the value function Q (s, a) of the state-action pair in the Q-Learning algorithm is adopted for updating, and the updating process is that
Qt+1(s,a)=(1-α)Qt(s,a)+α[Rt+γmaxb∈AQt(s’,b)],
Wherein s represents the current state, s' represents the state after state transition, QtAnd Qt+1Q values before and after update are respectively represented, α represents learning rate of reinforcement learning, α value represents that the update of the reward is more dependent on the instant reward, and RtThe reward R obtained by current iteration is represented, gamma is a discount factor, represents the importance degree of future reward, and the value is usually less than 1;
the strategy pi (s, a) in the Q-Learning algorithm represents the probability of selecting a action in the s state, and the action selection strategy is updated by adopting a Softmax function.
Specifically, the S3 includes the following steps:
s31, initializing the reinforcement learning model, setting the initial reward value R to 0, setting the initial action selection policy pi (S, a) to 1/M, and determining the weight value ξ according to the desire for energy consumption and energy balanceeAnd ξd,ξeAnd ξdThe sum of both equals 1;
s32, determining a learning rate α of reinforcement learning, determining a maximum iteration number K, and performing multiple iterations, it should be noted that the learning rate α has a value which is too small, so that the model learning is slow and poor in previous period, and too large, so that the model cannot be converged finally.
Obtaining the probability of selecting each action according to the action selection strategy pi (s, a), selecting any relay node according to the probability to transmit data, and after finishing one communication, according to the energy consumption E of the communication and the energy balance condition D after the communication is finishedmaxCalculating a reward value R and updating a value function Q (s, a), then calculating a new action selection strategy by using a Softmax function according to the new value function, obtaining the next communication task, and updating the state of the model when determining a target node needing to send a message.
Specifically, in the step S4,
if the reinforcement learning model is not converged after the set maximum iteration times are reached, adjusting the learning rate to be smaller, and carrying out iteration training again;
in the iteration process, before sending the message each time, the expected value of the accumulated reward value obtained by selecting each relay node is calculated according to the state, and the relay node with the largest reward is selected to forward the message.
The selection of the relay nodes is carried out by applying the model, so that the battery energy of each relay node can be ensured to be balanced under the condition of less energy consumption, the expected values of the relay nodes and the relay nodes can be adjusted according to the result, and the expected balance point can be searched between the relay nodes and the relay nodes.
In summary, according to the selection method of the relay node in M2M communication based on reinforcement learning provided by the present invention, the energy consumption of each communication and the energy balance of the battery supplying energy to the relay node are all included in the reinforcement learning target, so that the system is balanced in reducing energy consumption and realizing energy balance, energy imbalance is avoided, and the service life of the whole system is prolonged.
Meanwhile, the method of the invention can lead the system to autonomously finish the selection of the relay node, has strong self-adaptability, and does not need to reset the selection mode of the relay node even if new M2M equipment is added into the system midway, thereby providing convenience for the practical application of the method of the invention.
In addition, the method also provides reference for other related problems in the same field, has very wide application prospect, can be expanded and extended on the basis of the problem, and is applied to other technical schemes related to relay node selection in a wireless communication system.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.
Claims (9)
1. A method for selecting a relay node in M2M communication based on reinforcement learning is applied to an M2M communication system, and is characterized by comprising the following steps:
s1, building a reinforcement learning model framework, and determining an action set, a state set and a reward value of reinforcement learning;
s2, refining the frame of the reinforcement Learning model by adopting a Q-Learning algorithm;
s3, determining hyper-parameters in the Q-Learning algorithm according to specific problems, and performing iterative loop on the reinforcement Learning model;
and S4, adjusting the reinforcement learning model according to the training result, and selecting the relay node by applying the adjusted reinforcement learning model.
2. The method for selecting a relay node in reinforcement learning based M2M communication according to claim 1, wherein the S1 comprises the following steps:
s11, selecting and defining different relay nodes in the M2M communication system as actions in reinforcement learning;
s12, setting the residual energy of the battery of each relay node and the target node of the next communication as the state of reinforcement learning;
and S13, calculating the energy consumed by each relay node in the single communication selection process according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system.
3. The method of claim 2, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises:
the S11 includes the steps of,
recording the number of communication devices in an M2M communication system as N, recording the number of relay nodes as M, and recording the agent action as a ∈ A [ {1, 2., M }, namely the number of the selected relay node;
the S12 includes the steps of,
when the remaining energy of the battery of the relay node with the number i is Pi, the number of the communication destination device is n, and the reinforcement learning state S is
S=[P1,P2,...,PM,n];
The S13 includes the steps of,
according to the channel condition from each relay node to the communication equipment in the M2M communication system and the channel condition from the source node to each relay node in the M2M communication system, the energy consumed by each relay node in the single communication selection process is calculated by adopting a decoding and forwarding mode, then the energy consumption and the balance condition of a battery of the relay node are integrated to be used as a reward of reinforcement learning, the multi-objective problem is converted into the single-objective problem, and the adjustment of the intention balance points of the multi-objective problem and the relay node is completed through parameter adjustment.
4. The method of claim 3, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises:
in S13, energy consumption of the relay node is denoted as E, and energy balance of the relay node is denoted as DmaxThe difference between the maximum energy and the minimum energy of the battery of the relay node is represented, and the reward R of the reinforcement learning is
R=r0-ξeE-δξdDmax,
Wherein, ξeAnd ξdRespectively representing the weight occupied by the energy consumption and the energy balance, delta representing the mapping of the energy consumption and the energy balance to a similar range, r0Is a positive number with a value range of
max(ξeE+δξdDmax)<r0<2*max(ξeE+δξdDmax)。
5. The method for selecting a relay node in M2M communication based on reinforcement learning of claim 4, wherein the S2 comprises the following steps:
s21, performing barrel distribution according to the energy consumed by each relay node;
s22, taking the minimum value of the battery energy of each relay node, subtracting the minimum energy of the battery from the energy characteristic in the reinforcement learning state, and reflecting the energy balance condition of the relay nodes by the result;
and S23, updating the action selection strategy.
6. The method of claim 5, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises:
the S21 includes the steps of,
performing barrel separation according to the energy consumed by each relay node in the single communication selection process obtained in the step S1, so that the energy consumption is reduced from an infinite dimension to a finite dimension;
the S22 includes the steps of,
let the minimum energy of the battery of each relay node be PminThe updated remaining energy of the battery of the relay node with the number i is denoted as piModifying the reinforcement learning state S to
S=[p1,p2,...,pM,n],
Wherein p isi=Pi-Pmin;
The S23 includes the steps of,
the value function Q (s, a) of the state-action pair in the Q-Learning algorithm is adopted for updating, and the updating process is that
Qt+1(s,a)=(1-α)Qt(s,a)+α[Rt+γmaxb∈AQt(s’,b)],
Wherein s represents the current state, s' represents the state after state transition, QtAnd Qt+1Q values before and after update are respectively represented, α represents learning rate of reinforcement learning, α value represents that the update of the reward is more dependent on the instant reward, and RtThe method comprises the steps of representing reward R obtained by current iteration, wherein gamma is a discount factor, represents the importance degree of future reward and is less than 1;
the strategy pi (s, a) in the Q-Learning algorithm represents the probability of selecting a action in the s state, and the action selection strategy is updated by adopting a Softmax function.
7. The method for selecting a relay node in M2M communication based on reinforcement learning of claim 6, wherein the S3 comprises the following steps:
s31, initializing the reinforcement learning model, setting the initial reward value R to 0, setting the initial action selection policy pi (S, a) to 1/M, and determining the weight value ξ according to the desire for energy consumption and energy balanceeAnd ξd,ξeAnd ξdThe sum of both equals 1;
s32, determining the learning rate α of reinforcement learning, determining the maximum iteration number K, and performing multiple iterations.
8. The method of claim 7, wherein the S32 comprises:
obtaining the probability of selecting each action according to the action selection strategy pi (s, a), selecting any relay node according to the probability to transmit data, calculating a reward value according to the energy consumption of the communication and the energy balance condition after the communication is finished after one-time communication is finished, updating a value function Q (s, a), then calculating a new action selection strategy by using a Softmax function according to a new value function to obtain the next communication task, and updating the state of a model when a target node needing to transmit messages is determined.
9. The method of claim 8, wherein the method for selecting the relay node in the reinforcement learning based M2M communication comprises: in the above-mentioned S4, the method,
if the reinforcement learning model is not converged after the set maximum iteration times are reached, adjusting the learning rate to be smaller, and carrying out iteration training again;
in the iteration process, before sending the message each time, the expected value of the accumulated reward value obtained by selecting each relay node is calculated according to the state, and the relay node with the largest reward is selected to forward the message.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010040442.4A CN111246438B (en) | 2020-01-15 | 2020-01-15 | Method for selecting relay node in M2M communication based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010040442.4A CN111246438B (en) | 2020-01-15 | 2020-01-15 | Method for selecting relay node in M2M communication based on reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111246438A true CN111246438A (en) | 2020-06-05 |
CN111246438B CN111246438B (en) | 2023-03-24 |
Family
ID=70877794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010040442.4A Active CN111246438B (en) | 2020-01-15 | 2020-01-15 | Method for selecting relay node in M2M communication based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111246438B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112039767A (en) * | 2020-08-11 | 2020-12-04 | 山东大学 | Multi-data center energy-saving routing method and system based on reinforcement learning |
CN113726664A (en) * | 2021-08-03 | 2021-11-30 | 哈尔滨工程大学 | Internet of vehicles node forwarding utility learning method based on double updating strategies |
CN113780482A (en) * | 2021-11-12 | 2021-12-10 | 中国科学院理化技术研究所 | Intelligent detection method for abnormity of energy router |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106304239A (en) * | 2016-08-12 | 2017-01-04 | 梁广俊 | The relay selection method of energy acquisition multi-relay cooperation communication system |
CN108282238A (en) * | 2018-01-29 | 2018-07-13 | 厦门大学 | A kind of relay selection method and device based on network coding technique |
CN108712760A (en) * | 2018-03-29 | 2018-10-26 | 北京邮电大学 | High-throughput relay selection method based on random Learning Automata and fuzzy algorithmic approach |
CN108848540A (en) * | 2018-06-20 | 2018-11-20 | 中国联合网络通信集团有限公司 | Relay node selecting method and system |
CN109195207A (en) * | 2018-07-19 | 2019-01-11 | 浙江工业大学 | A kind of energy-collecting type wireless relay network througput maximization approach based on deeply study |
CN110636523A (en) * | 2019-09-20 | 2019-12-31 | 中南大学 | Millimeter wave mobile backhaul link energy efficiency stabilization scheme based on Q learning |
CN110650480A (en) * | 2019-09-24 | 2020-01-03 | 南京邮电大学 | Wireless network relay and interference node selection method facing safe communication |
-
2020
- 2020-01-15 CN CN202010040442.4A patent/CN111246438B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106304239A (en) * | 2016-08-12 | 2017-01-04 | 梁广俊 | The relay selection method of energy acquisition multi-relay cooperation communication system |
CN108282238A (en) * | 2018-01-29 | 2018-07-13 | 厦门大学 | A kind of relay selection method and device based on network coding technique |
CN108712760A (en) * | 2018-03-29 | 2018-10-26 | 北京邮电大学 | High-throughput relay selection method based on random Learning Automata and fuzzy algorithmic approach |
CN108848540A (en) * | 2018-06-20 | 2018-11-20 | 中国联合网络通信集团有限公司 | Relay node selecting method and system |
CN109195207A (en) * | 2018-07-19 | 2019-01-11 | 浙江工业大学 | A kind of energy-collecting type wireless relay network througput maximization approach based on deeply study |
CN110636523A (en) * | 2019-09-20 | 2019-12-31 | 中南大学 | Millimeter wave mobile backhaul link energy efficiency stabilization scheme based on Q learning |
CN110650480A (en) * | 2019-09-24 | 2020-01-03 | 南京邮电大学 | Wireless network relay and interference node selection method facing safe communication |
Non-Patent Citations (4)
Title |
---|
YUHAN SU等: "Cooperative Communications With Relay Selection Based on Deep Reinforcement Learning in Wireless Sensor Networks", 《IEEE SENSORS JOURNAL》 * |
吴子秋: "M2M通信中基于深度强化学习的中继选择算法", 《中国硕士学位论文全文数据库(电子期刊) 信息科技辑》 * |
宋夫静: "基于增强学习的协作通信中继选择方案", 《信息技术与信息化》 * |
郑姗姗: "M2M通信中基于机器学习的无线资源管理的研究", 《中国硕士学位论文全文数据库(电子期刊) 信息科技辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112039767A (en) * | 2020-08-11 | 2020-12-04 | 山东大学 | Multi-data center energy-saving routing method and system based on reinforcement learning |
CN113726664A (en) * | 2021-08-03 | 2021-11-30 | 哈尔滨工程大学 | Internet of vehicles node forwarding utility learning method based on double updating strategies |
CN113780482A (en) * | 2021-11-12 | 2021-12-10 | 中国科学院理化技术研究所 | Intelligent detection method for abnormity of energy router |
Also Published As
Publication number | Publication date |
---|---|
CN111246438B (en) | 2023-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111246438B (en) | Method for selecting relay node in M2M communication based on reinforcement learning | |
CN110488861B (en) | Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle | |
CN110809306B (en) | Terminal access selection method based on deep reinforcement learning | |
Iqbal et al. | Double deep Q-network-based energy-efficient resource allocation in cloud radio access network | |
CN108075975B (en) | Method and system for determining route transmission path in Internet of things environment | |
CN111491358B (en) | Adaptive modulation and power control system based on energy acquisition and optimization method | |
CN109769280B (en) | WIFI intelligent prediction switching method based on machine learning | |
Esmaeili et al. | CMML: Combined metaheuristic-machine learning for adaptable routing in clustered wireless sensor networks | |
CN113473480B (en) | Improved reinforcement learning network coverage optimization method facing cellular network | |
CN111865474B (en) | Wireless communication anti-interference decision method and system based on edge calculation | |
CN115065678A (en) | Multi-intelligent-device task unloading decision method based on deep reinforcement learning | |
Lu et al. | Learning deterministic policy with target for power control in wireless networks | |
CN117641542B (en) | Method for online optimization of power consumption strategy of terminal of Internet of things | |
CN114710439B (en) | Network energy consumption and throughput joint optimization routing method based on deep reinforcement learning | |
CN116185523A (en) | Task unloading and deployment method | |
Xu et al. | Energy consumption minimization for data collection from wirelessly-powered IoT sensors: Session-specific optimal design with DRL | |
Song et al. | Fast-DRD: Fast decentralized reinforcement distillation for deadline-aware edge computing | |
Yang et al. | On sampling time maximization in wireless powered Internet of Things | |
CN102077526A (en) | Method, apparatus and computer program product for distributed information management | |
Paterova et al. | Data-driven self-learning controller design approach for power-aware iot devices based on double q-learning strategy | |
CN116976523A (en) | Distributed economic dispatching method based on partially observable reinforcement learning | |
CN110705756A (en) | Electric power energy consumption optimization control method based on input convex neural network | |
Zhao et al. | MEDIA: An incremental DNN based computation offloading for collaborative cloud-edge computing | |
CN114611823B (en) | Optimized dispatching method and system for electricity-cold-heat-gas multi-energy-demand typical park | |
CN113938978B (en) | Heterogeneous wireless sensor path finding method based on reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |