WO2023179010A1 - 一种noma-mec***中的用户分组和资源分配方法及装置 - Google Patents

一种noma-mec***中的用户分组和资源分配方法及装置 Download PDF

Info

Publication number
WO2023179010A1
WO2023179010A1 PCT/CN2022/127173 CN2022127173W WO2023179010A1 WO 2023179010 A1 WO2023179010 A1 WO 2023179010A1 CN 2022127173 W CN2022127173 W CN 2022127173W WO 2023179010 A1 WO2023179010 A1 WO 2023179010A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
reinforcement learning
deep reinforcement
network
channel
Prior art date
Application number
PCT/CN2022/127173
Other languages
English (en)
French (fr)
Inventor
赵莎莎
秦立丹
张登银
孙晨辉
文清
陈瑞杰
刘雨凡
Original Assignee
南京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京邮电大学 filed Critical 南京邮电大学
Publication of WO2023179010A1 publication Critical patent/WO2023179010A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/53Allocation or scheduling criteria for wireless resources based on regulatory allocation policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0453Resources in frequency domain, e.g. a carrier in FDMA
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the invention belongs to the field of mobile communications and deep reinforcement learning, and specifically relates to a computing offloading method and device in a NOMA-MEC system based on hybrid deep reinforcement learning.
  • NOMA Non-Orthogonal Multiple Access
  • the purpose of the present invention is to propose a user grouping and resource allocation method in the NOMA-MEC system based on hybrid deep reinforcement learning, which can schedule resources in real time in the dynamic NOMA-MEC system to maximize system energy efficiency.
  • the present invention provides a user grouping and resource allocation method in the NOMA-MEC system based on hybrid deep reinforcement learning, which includes the following steps:
  • Step 1 Describe the NOMA-MEC system.
  • Step 2 Define the energy efficiency of the system
  • Step 3 Describe the optimization problem
  • Step 4 Define the state space of deep reinforcement learning and the action space of deep reinforcement learning
  • Step 5 Construct a hybrid deep reinforcement learning network; the input of the network is the state and the output is the action;
  • Step 6 Input each time slot status into the hybrid deep reinforcement learning network to generate actions
  • Step 7 Train the hybrid deep reinforcement learning network
  • Step 8 Repeat steps 6 and 7 until the number of repetitions reaches the specified number of time slots T, and output the actions generated at this time, that is, the decisions to be optimized: user grouping, calculation offloading, and bandwidth allocation ratio.
  • methods to describe the NOMA-MEC system include:
  • the NOMA-MEC system consists of K user equipment and a single-antenna base station connected to the edge server, and all users have only a single transmitting antenna to establish a communication link with the base station.
  • the total system bandwidth B is divided into N orthogonal sub-channels, and the proportion of the bandwidth of sub-channel n to the total bandwidth is ⁇ n , definition and Represent user set and orthogonal sub-channel set respectively, K ⁇ 2N;
  • the method for defining the energy efficiency of the system in Step 2 includes:
  • Step 2.1) The energy efficiency Y of the system is defined as the sum of the ratio of all users’ calculation rates to their calculation power, as shown in the following formula:
  • R i,off represents the calculation rate at which user i offloads computing tasks to the edge server
  • p i is the transmission power of user i, which does not change with time, and the transmission power of all users is the same
  • R i,local represents user i
  • the calculation rate of locally executed tasks, p i,local represents the power of local execution of user i
  • step 3 the optimization problem in step 3 is described as:
  • step 4 to define the state space and action space of deep reinforcement learning includes:
  • step 5 the method of building a hybrid deep reinforcement learning network in step 5 includes:
  • the hybrid deep reinforcement network includes a continuous layer deep reinforcement learning network and a discrete layer deep reinforcement learning network; the continuous layer deep reinforcement learning network is DDPG, and the discrete layer deep reinforcement learning network is DQN.
  • step 6 the method of inputting each time slot state into the hybrid deep reinforcement learning network to generate actions includes:
  • Step 6.1) Input the system status into the hybrid deep reinforcement learning network, the a_c bandwidth allocation ratio is generated by the DDPG Actor network, and the a_d user grouping situation is generated by the DQN network;
  • step 7 trains the hybrid deep reinforcement learning network method including:
  • the DDPG network and the DQN network share status and reward values;
  • the DDPG network and the DQN network sample D samples from the experience pool for training and update their own parameters.
  • the present invention provides a device for user grouping and resource allocation in a NOMA-MEC system based on hybrid deep reinforcement learning, which includes the following steps:
  • System description module used to describe the NOMA-MEC system
  • Efficiency definition module used to define the energy efficiency of the system
  • Problem description module used to describe optimization problems
  • Space definition module used to define the state space of deep reinforcement learning and the action space of deep reinforcement learning
  • Network building module used to build a hybrid deep reinforcement learning network; the input of the network is the state and the output is the action;
  • Action generation module used to input each time slot state into the hybrid deep reinforcement learning network to generate actions
  • Network training module used to train hybrid deep reinforcement learning networks
  • Output module After the number of repeated training reaches the specified number of time slots T, the actions generated at this time are output, that is, the decisions to be optimized: user grouping, calculation offloading, and bandwidth allocation ratio.
  • the present invention provides a user grouping and resource allocation device in a NOMA-MEC system based on hybrid deep reinforcement learning, including a processor and a storage medium; the storage medium is used to store instructions;
  • the processor is configured to operate according to the instructions to perform the steps of the method described in the first aspect.
  • the present invention proposes a novel hybrid deep reinforcement learning algorithm.
  • This algorithm can solve the problem of having both discrete action space and continuous action space, and dynamically and real-time determine sub-channel allocation according to the system state. Compute offloading decisions, bandwidth allocation schemes to maximize the long-term energy efficiency of the system.
  • the main problem solved is that the algorithm determines bandwidth allocation proportions, user grouping conditions, and task offloading decisions based on time-varying channel conditions;
  • the present invention uses the proposed method to determine user grouping conditions, calculation offloading decisions, and bandwidth allocation ratios to maximize the ratio of the system's calculation rate to the power consumed.
  • the method of the present invention can make optimal decisions in a dynamic environment, and the proposed hybrid deep reinforcement learning method can overcome the shortcomings of a single deep reinforcement learning method that cannot handle tasks with both continuous action space and discrete action space.
  • Figure 1 is a schematic diagram of the system network of the present invention
  • Figure 2 is a flow chart of the hybrid deep reinforcement learning algorithm.
  • this embodiment shows the user grouping and resource allocation method in the NOMA-MEC system based on hybrid deep reinforcement learning.
  • the method includes the following steps:
  • Step 1 Describe the NOMA-MEC system.
  • Step 2 Define the energy efficiency of the system.
  • Step 3 Describe the optimization problem.
  • Step 4 Define the state space of deep reinforcement learning and define the action space of deep reinforcement learning.
  • Step 5 Build a hybrid deep reinforcement learning network.
  • Step 6 Input each time slot state into the hybrid deep reinforcement learning network to generate actions.
  • Step 7 Train the hybrid deep reinforcement learning network
  • Step 8 Repeat steps 6 and 7 until the number of repetitions reaches the specified number of time slots T.
  • the algorithm terminates the operation and outputs the action at this time.
  • the action is output according to the constructed algorithm model.
  • the action is the decision to be optimized by the present invention - user grouping. , calculation offloading, bandwidth allocation ratio.
  • the method of describing the NOMA-MEC system in step 1 includes:
  • the NOMA-MEC system consists of K user equipment and a single-antenna base station connected to the edge server, and all users have only a single transmitting antenna to establish a communication link with the base station.
  • the total system bandwidth B is divided into N orthogonal sub-channels, and the proportion of the bandwidth of sub-channel n to the total bandwidth is ⁇ n , definition and Represent user set and orthogonal sub-channel set respectively, K ⁇ 2N.
  • the channel gain remains constant within a time slot and changes between different time slots.
  • the present invention limits one channel to be able to transmit signals of two users at the same time, and users can transmit signals at one time. Signals are only sent on one channel within the slot.
  • m nk 1 indicates that channel n is allocated to user k to send signals
  • m nk 0 indicates that channel n is not allocated to user k to send signals. .
  • the method for defining the energy efficiency of the system in Step 2 includes:
  • Step 2.1) The energy efficiency Y of the system is defined as the sum of the ratio of all users’ calculation rates to their calculation power, as shown in the following formula:
  • R i,off represents the calculation rate at which user i offloads computing tasks to the edge server.
  • p i is the transmission power of user i, which does not change with time, and the transmission power of all users is the same.
  • R i,local represents the computing rate of user i's local execution of tasks
  • p i,local represents the power of user i's local execution
  • the base station decodes in descending order according to the user's channel gain, then the offloading rate of user i User j’s uninstall rate where N 0 is the power spectral density of noise.
  • step 3 the optimization problem in step 3 is described as
  • the method of defining the state space and action space of deep reinforcement learning in step 4 includes:
  • the method of building a hybrid deep reinforcement learning network in step 5 includes:
  • Step 5.1) Construct a hybrid deep reinforcement learning network.
  • the hybrid deep reinforcement network consists of two layers.
  • the continuous layer deep reinforcement learning network is DDPG.
  • the discrete layer deep reinforcement learning network is DQN.
  • the DDPG network consists of the Actor current network, Actor target network, Critic current network and Critic target network.
  • the four network parameters are ⁇ DDPG , ⁇ ' DDPG , ⁇ DDPG and ⁇ ' DDPG in order.
  • the function of the Actor network is to output action decisions based on the input state.
  • the function of the Critic network is to estimate the value of an action taken by the Actor network in a certain state - the Q value, and guide the action selection in the next state.
  • the DQN network consists of the DQN current network and the DQN target network.
  • the parameters of the two networks are ⁇ DQN and ⁇ ' DQN respectively. Construct a neural network, initialize DDPG network parameters, DQN network parameters, and experience pool capacities E DQN and E TD3 .
  • step 6 the method of inputting each time slot state into the hybrid deep reinforcement learning network to generate actions includes:
  • the system status is input into the hybrid deep reinforcement learning network, the a_c bandwidth allocation ratio is generated by the DDPG Actor network, and the a_d user grouping situation is generated by the DQN network.
  • the channel allocation plan that is, the user grouping situation m nk and the bandwidth allocation ratio ⁇ nk , maximizing the system computing efficiency is decomposed into maximizing the computing efficiency of each channel Y n :
  • the matrix X is initialized to a zero matrix at each time step.
  • (x n,i ,x n,j ) has 4 possible values, namely (0,0), (1,0), (0,1), (1,1). Substitute the four combinations into the above equation, select the combination that maximizes Y n , and reset the value of the corresponding position of X.
  • the method of training hybrid deep reinforcement learning network in step 7 includes:
  • the DDPG network and the DQN network share status and reward values.
  • the cost function is as follows:
  • D is the sample size taken from the experience pool.
  • the weights of the DQN current network are copied to the DQN target network.
  • the update of the Actor's current network weight depends on the Q value of the Critic's current network.
  • the Actor's current network updates its network parameters in the direction of obtaining greater cumulative rewards.
  • the update formula of the Actor's current network parameters is as follows:
  • DDPG uses soft update method to update the target network parameters.
  • the soft update formula is as follows:
  • is generally taken as 0.001.
  • Step 8 Repeat steps 6 and 7 until the number of repetitions reaches the total number of timeslots T, thus stopping the algorithm.
  • the present invention establishes a NOMA-MEC system and proposes a new sub-channel allocation, calculation offloading decision, and bandwidth allocation scheme based on hybrid deep reinforcement learning to maximize the long-term energy efficiency of the system.
  • This embodiment provides a device for user grouping and resource allocation in the NOMA-MEC system based on hybrid deep reinforcement learning, which includes the following steps:
  • System description module used to describe the NOMA-MEC system
  • Efficiency definition module used to define the energy efficiency of the system
  • Problem description module used to describe optimization problems
  • Space definition module used to define the state space of deep reinforcement learning and the action space of deep reinforcement learning
  • Network building module used to build a hybrid deep reinforcement learning network; the input of the network is the state and the output is the action;
  • Action generation module used to input each time slot state into the hybrid deep reinforcement learning network to generate actions
  • Network training module used to train hybrid deep reinforcement learning networks
  • Output module After the number of repeated training reaches the specified number of time slots T, the actions generated at this time are output, that is, the decisions to be optimized: user grouping, calculation offloading, and bandwidth allocation ratio.
  • the device of this embodiment can be used to implement the method described in Embodiment 1.
  • This embodiment provides a user grouping and resource allocation device in the NOMA-MEC system based on hybrid deep reinforcement learning, including a processor and a storage medium;
  • the storage medium is used to store instructions
  • the processor is configured to operate according to the instructions to execute the steps of the method described in Embodiment 1.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

本发明公开了一种基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配方法及装置。本发明提出的混合深度强化学习算法利用DDPG优化连续动作和DQN优化离散动作解决了深度强化学习难以处理同时具有离散和连续动作空间的混合问题。具体地,算法根据用户的信道状态确定用户设备的带宽分配、卸载决策、子信道分配(用户分组情况),以使***的计算速率与所耗功率之比最大化。该算法可以良好地适应环境的动态特性,有效提升***的能量效率和频谱资源利用率。

Description

一种NOMA-MEC***中的用户分组和资源分配方法及装置 技术领域
本发明属于移动通信与深度强化学习领域,具体涉及一种基于混合深度强化学习的NOMA-MEC***中的计算卸载方法及装置。
背景技术
随着智能设备数量的显著增加,大量的用户设备产生大量需要处理的数据。然而由于智能设备的尺寸限制,它的计算资源和能量资源都很贫乏,这使它面临着服务需求的巨大挑战。因此为了提高任务处理效率满足服务需求,移动边缘计算(Mobile Edge Computing,MEC)技术应运而生。此外,数据流量的***性增长引起了海量接入的迫切需求和频谱资源急剧短缺问题,第五代(the fifth generation,5G)通信中的非正交多址接入(Non-Orthogonal Multiple Access,NOMA)技术正是解决这些问题的有效方案。因此,NOMA-MEC的技术研究近年来引起了广泛关注。
目前针对NOMA-MEC***中用户分组和资源分配策略的研究大部分都使用传统的优化方法进行求解,例如通过迭代算法收敛得到最优解,或者通过启发式算法得到次优解。但这些方法要么计算复杂度太高,要么只能得到次优解,更重要的是缺乏对动态***的自适应能力。
发明内容
本发明目的在于提出了一种基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配方法,该方法能够在动态NOMA-MEC***中实时调度资源使***能效最大化。
为达到上述目的,本发明采用下述技术方案实现:
本发明提供了一种基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配方法,包括以下步骤:
步骤1、描述NOMA-MEC***,所述***以时隙方式运行,时隙集合记为Γ={1,2,...,T};
步骤2、定义该***的能量效率;
步骤3、描述优化问题;
步骤4、定义深度强化学习的状态空间、深度强化学习的动作空间;
步骤5、构建混合深度强化学习网络;所述网络的输入是状态,输出是动作;
步骤6、将每一个时隙状态输入混合深度强化学习网络生成动作;
步骤7、训练混合深度强化学习网络;
步骤8、重复步骤6和步骤7直到重复次数达到规定的时隙次数T后,输出此时生成的动作,即要优化的决策:用户分组、计算卸载、带宽分配比例。
进一步的,描述NOMA-MEC***的方法包括:
所述NOMA-MEC***由K个用户设备和一个与边缘服务器相连的单天线基站组成,并且所有用户都只有单个发射天线与基站建立通信链路,所述***以时隙方式运行,时隙集合记为Γ={1,2,...,T};
将***总带宽B被分为N个正交的子信道,子信道n的带宽占总带宽的比例为τ n
Figure PCTCN2022127173-appb-000001
定义
Figure PCTCN2022127173-appb-000002
Figure PCTCN2022127173-appb-000003
分别表示用户集合、正交子信道集合,K≤2N;
将整个过程划分为一个个时隙,Γ={1,2,...,T};信道增益在一个时隙的时间段内保持不变,在不同时隙间变化,
Figure PCTCN2022127173-appb-000004
表示信道n上用户k到基站的信道增益,并且设
Figure PCTCN2022127173-appb-000005
限制一个信道最多可供两个用户信号同时传输,并且用户在一个时隙内只在一条信道上发送信号;m nk=1表示信道n分配给用户k发送信号,m nk=0表示信道n并不分配给用户k发送信号。
进一步的,步骤2定义该***的能量效率的方法包括:
步骤2.1)该***的能量效率Y定义为所有用户计算速率与计算功率比值的和,如以下公式:
Figure PCTCN2022127173-appb-000006
其中,R i,off表示用户i将计算任务卸载到边缘服务器执行的计算速率,p i为用户i的发送功率,其不随时间变化,并且所有用户的发射功率相同;R i,local表示用户i本地执行任务的计算速率,p i,local表示用户i本地执行的功率,x ni=1代表用户i通过信道n卸载任务到边缘服务器执行,x ni=0代表用户i不通过信道卸载任务到边缘服务器执行;
步骤2.2)因为信道n上用户i的信道增益h ni大于用户j的信道增益h nj;根据串行干扰消除技术,基站按照用户的信道增益降序解码,则用户i的卸载速率
Figure PCTCN2022127173-appb-000007
用户j的卸载速率
Figure PCTCN2022127173-appb-000008
其中N 0为噪声的功率谱密度,
步骤2.3)用户i和用户j的本地执行的计算速率分别为
Figure PCTCN2022127173-appb-000009
其中f i和f j为用户的CPU处理能力,
Figure PCTCN2022127173-appb-000010
为处理1bit任务所需的周期数;用户i和用户j本地执行的计算功率分别为p i,local=νf i 3、p j,local=νf j 3,其中ν为用户设备芯片架构的电容有效系数;。
进一步的,步骤3优化问题描述为:
Figure PCTCN2022127173-appb-000011
Figure PCTCN2022127173-appb-000012
Figure PCTCN2022127173-appb-000013
Figure PCTCN2022127173-appb-000014
Figure PCTCN2022127173-appb-000015
进一步的,步骤4定义深度强化学习的状态空间、动作空间的方法包括:
步骤4.1)所述状态空间s,s={h 11,h 12,...h 1K,h 21,h 22,...,h 2K,h N1...h NK};
步骤4.2)所述动作空间a由两阶段组成,a={a_c,a_d},其中a_c={τ 12,...,τ N}为连续动作表示***带宽分配比例,a_d={m 11,m 12,...,m 1K,...,m N1,m N2,...,m NK,x 11,x 12,...,x 1K,...,x N1,x N2,...,x NK}为离散动作表示子信道分配方案;
进一步的,步骤5构建混合深度强化学习网络的方法包括:
混合深度强化网络包括连续层深度强化学习网络和离散层深度强化学习网络;连续层深度强化学习网络为DDPG,离散层深度强化学习网络为DQN。
进一步的,步骤6将每一个时隙状态输入混合深度强化学习网络生成动作的方法包括:
步骤6.1)将***状态输入到混合深度强化学习网络中,由DDPG的Actor网络生成a_c带宽分配比例,由DQN网络生成a_d用户分组情况;
步骤6.2)在户分组情况、带宽分配比例确定后,将最大化***能量效率分解为最大化每个信道的能量效率Y n
问题转化为
Figure PCTCN2022127173-appb-000016
其中矩阵X在每个时间步初始化为零矩阵;(x n,i,x n,j)有4种取值可能,分别为(0,0)、(1,0)、(0,1)、(1,1),其中,x的取值就是确定卸载决策,0表示不将用户设备的计算任务卸载到边缘服务器执行,1表示卸载到边缘服务器执行,将4种组合分别代入上式,选择使Y n最大的组合,重置X对应位置的值。
进一步的,步骤7训练混合深度强化学习网络方法包括:
基站在状态s下,执行动作a=(a_c,a_d)后得到环境反馈的即时奖励
Figure PCTCN2022127173-appb-000017
并且获取下一时隙的状态s';
将(s,a_c,r,s')存储到DDPG经验池,样本(s,a_d,r,s')存储到DQN经验池,DDPG网络和DQN网络共享状态和奖励值;
DDPG网络和DQN网络从经验池采样D个样本训练更新自身参数。
第二方面,本发明提供一种基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配装置,包括以下步骤:
***描述模块:用于描述NOMA-MEC***;
效率定义模块:用于定义该***的能量效率;
问题描述模块:用于描述优化问题;
空间定义模块:用于定义深度强化学习的状态空间、深度强化学习的动作空间;
网络构建模块:用于构建混合深度强化学习网络;所述网络的输入是状态,输出是动作;
动作生成模块:用于将每一个时隙状态输入混合深度强化学习网络生成动作;
网络训练模块:用于训练混合深度强化学习网络;
输出模块:重复训练次数达到规定的时隙次数T后,输出此时生成的动作,即要优化的决策:用户分组、计算卸载、带宽分配比例。
第三方面,本发明提供一种基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配装置,包括处理器及存储介质;所述存储介质用于存储指令;
所述处理器用于根据所述指令进行操作以执行第一方面所述方法的步骤。
与现有技术相比,本发明所达到的有益效果:
1.本发明基于NOMA-MEC***,提出了一种新颖的混合深度强化学习算法,该算法能够解决同时具有离散动作空间和连续动作空间的问题,并动态实时地根据***状态决定子信道分配、计算卸载决策、带宽分配方案,以最大限度地提高***长期能量效率。解决的主要问题是算法根据时变的信道条件决定带宽分配比例、用户分组情况、任务卸载决策;
2.本发明在NOMA-MEC场景中,使用提出的方法确定用户分组情况、计算卸载决策、带宽分配比例来最大化***的计算速率与所耗功率之比。
3.本发明的方法可动态环境下做出最优决策,并且提出的混合深度强化学习方法可克服单一深度强化学习方法不能处理同时具有连续动作空间和离散动作空间类任务缺点。
附图说明
图1为本发明的***网络示意图;
图2为混合深度强化学习算法流程图。
具体实施方式
下面结合附图对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案,而不能以此来限制本发明的保护范围。
实施例一:
结合图1,本实施例基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配方法。该方法包括以下步骤:
步骤1、描述NOMA-MEC***,***以时隙方式运行,时隙集合记为Γ={1,2,...,T};
步骤2、定义该***的能量效率。
步骤3、描述优化问题。
步骤4、定义深度强化学习的状态空间、定义深度强化学习的动作空间。
步骤5、构建混合深度强化学习网络。
步骤6、将每一个时隙状态输入混合深度强化学习网络生成动作。
步骤7、训练混合深度强化学习网络;
步骤8、重复步骤6和步骤7直到重复次数达到规定的时隙次数T,算法终止运行,输出此时的动作,根据构建的算法模型输出动作,动作就是本发明要优化的决策——用户分组、计算卸载、带宽分配比例。
具体的,步骤1描述NOMA-MEC***的方法包括:
步骤1.1)所述NOMA-MEC***由K个用户设备和一个与边缘服务器相连的单天线基站组成,并且所有用户都只有单个发射天线与基站建立通信链路。***总带宽B被分为N个 正交的子信道,子信道n的带宽占总带宽的比例为τ n
Figure PCTCN2022127173-appb-000018
定义
Figure PCTCN2022127173-appb-000019
Figure PCTCN2022127173-appb-000020
分别表示用户集合、正交子信道集合,K≤2N。本发明将整个过程划分为一个个时隙,Γ={1,2,...,T}。信道增益在一个时隙的时间段内保持不变,在不同时隙间变化,
Figure PCTCN2022127173-appb-000021
表示信道n上用户k到基站的信道增益,并且设
Figure PCTCN2022127173-appb-000022
功率域NOMA场景下,多个用户可同时在同一个子信道中传输信号,为了避免子信道中的用户干扰过大,本发明限制一个信道最多可供两个用户信号同时传输,并且用户在一个时隙内只在一条信道上发送信号,m nk=1表示信道n分配给用户k发送信号,m nk=0表示信道n并不分配给用户k发送信号。。
具体的,步骤2定义该***的能量效率的方法包括:
步骤2.1)该***的能量效率Y定义为所有用户计算速率与计算功率比值的和,如以下公式:
Figure PCTCN2022127173-appb-000023
其中为了便于公式表达本发明省略了时隙t的描述。R i,off表示用户i将计算任务卸载到边缘服务器执行的计算速率,p i为用户i的发送功率,其不随时间变化,并且所有用户的发射功率相同。R i,local表示用户i本地执行任务的计算速率,p i,local表示用户i本地执行的功率,x ni=1代表用户i通过信道n卸载任务到边缘服务器执行,x ni=0代表用户i不通过信道卸载任务到边缘服务器执行。
步骤2.2)因为信道n上用户i的信道增益h ni大于用户j的信道增益h nj。根据串行干扰消除技术,基站按照用户的信道增益降序解码,则用户i的卸载速率
Figure PCTCN2022127173-appb-000024
用户j的卸载速率
Figure PCTCN2022127173-appb-000025
其中N 0为噪声的功率谱密度。
步骤2.3)用户i和用户j的本地执行的计算速率分别为
Figure PCTCN2022127173-appb-000026
其中f i和f j为用户的CPU处理能力,
Figure PCTCN2022127173-appb-000027
为处理1bit任务所需的周期数;用户i和用户j本地执行的计算功率分别为p i,local=νf i 3、p j,local=νf j 3,其中ν为用户设备芯片架构的电容有效系数;
具体的,步骤3优化问题描述为
Figure PCTCN2022127173-appb-000028
Figure PCTCN2022127173-appb-000029
Figure PCTCN2022127173-appb-000030
Figure PCTCN2022127173-appb-000031
Figure PCTCN2022127173-appb-000032
具体的,步骤4定义深度强化学习的状态空间、动作空间的方法包括:
步骤4.1)所述状态空间s,s={h 11,h 12,...h 1K,h 21,h 22,...,h 2K,h N1...h NK}。
步骤4.2)所述动作空间a由两阶段组成,a={a_c,a_d},其中a_c={τ 12,...,τ N}为连续动作表示***带宽分配比例,a_d={m 11,m 12,...,m 1K,...,m N1,m N2,...,m NK,x 11,x 12,...,x 1K,...,x N1,x N2,...,x NK}为离散动作表示子信道分配方案。
具体的,步骤5构建混合深度强化学习网络的方法包括:
步骤5.1)构建混合深度强化学习网络,混合深度强化网络由两层组成。连续层深度强化学习网络为DDPG。离散层深度强化学习网络为DQN。
步骤5.2)DDPG网络由Actor当前网络、Actor目标网络、Critic当前网络和Critic目标网络组成,四个网络参数依次为θ DDPG、θ' DDPG、ω DDPG和ω' DDPG。Actor网络的作用是依据输入的状态输出动作决策,Critic网络的作用是估计Actor网络在某一状态下采取某一动作的价值——Q值,并指导下一状态的动作选择。DQN网络由DQN当前网络和DQN目标网络组成,两个网络的参数分别为ω DQN、ω' DQN。构建神经网络,初始化DDPG网络参数,DQN网络参数,经验池容量E DQN、E TD3
具体的,步骤6将每一个时隙状态输入混合深度强化学习网络生成动作的方法包括:
将***状态输入到混合深度强化学习网络中,由DDPG的Actor网络生成a_c带宽分配比例,由DQN网络生成a_d用户分组情况。此时根据信道分配方案即用户分组情况m nk、带宽分配比例τ nk,将最大化***计算效率分解为最大化每个信道的计算效率Y n
问题转化为
Figure PCTCN2022127173-appb-000033
矩阵X在每个时间步初始化为零矩阵。(x n,i,x n,j)有4种取值可能,分别为(0,0)、(1,0)、(0,1)、(1,1)。将4种组合分别代入上式,选择使Y n最大的组合,重置X对应位置的值。
具体的,步骤7训练混合深度强化学习网络方法包括:
基站在状态s下,执行动作a={a_c,a_d}后得到环境反馈的即时奖励
Figure PCTCN2022127173-appb-000034
并且获取下一时隙的状态s'。将(s,a_c,r,s')存储到DDPG经验池,样本(s,a_d,r,s')存储到DQN经验池,DDPG网络和DQN网络共享状态和奖励值。
在混合深度强化学习网络运行过程中,需要不断训练网络,调整网络参数提高函数拟合能力,以使算法在动态变化的状态环境下始终能够输出合理的动作决策。
通过最小化神经网络的代价函数调整DQN当前网络的参数,代价函数如下:
Figure PCTCN2022127173-appb-000035
其中D为从经验池中采取的样本量大小。
在DQN当前网络更新一些次数后,将DQN当前网络的权重复制给DQN目标网络。
DDPG网络各个参数的更新方式如式(0.2)(0.3)(0.4)(0.5)所示。Critic当前网络参数更新公式如下:
Figure PCTCN2022127173-appb-000036
Figure PCTCN2022127173-appb-000037
Actor当前网络权重的更新依赖于Critic当前网络的Q值,Actor当前网络朝着可以获得更大累计奖励的方向更新其网络参数,Actor当前网络参数的更新公式如下:
Figure PCTCN2022127173-appb-000038
与DQN算法直接将DQN当前网络参数复制给DQN目标网络参数不同,DDPG使用软 更新方式更新目标网络参数,软更新公式如下:
Figure PCTCN2022127173-appb-000039
其中τ一般取0.001。
步骤8、重复步骤6和步骤7直到重复次数为时隙总次数T,从而停止算法运行。
综上所述,本发明建立了NOMA-MEC***,基于混合深度强化学习提出了一种新的子信道分配、计算卸载决策、带宽分配方案,以最大限度地提高***长期能量效率。
需要说明的是,以上所述的实施例仅是本发明的具体实施方式,但本发明的保护范围并不仅限于此。凡是基于本发明的替换、改进等,均应包括在本发明的权利要求之内。
实施例二:
本实施例提供一种基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配装置,包括以下步骤:
***描述模块:用于描述NOMA-MEC***;
效率定义模块:用于定义该***的能量效率;
问题描述模块:用于描述优化问题;
空间定义模块:用于定义深度强化学习的状态空间、深度强化学习的动作空间;
网络构建模块:用于构建混合深度强化学习网络;所述网络的输入是状态,输出是动作;
动作生成模块:用于将每一个时隙状态输入混合深度强化学习网络生成动作;
网络训练模块:用于训练混合深度强化学习网络;
输出模块:重复训练次数达到规定的时隙次数T后,输出此时生成的动作,即要优化的决策:用户分组、计算卸载、带宽分配比例。
本实施例的装置可用于实现实施例一所述的方法。
实施例三:
本实施例提供一种基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配装置,包括处理器及存储介质;
所述存储介质用于存储指令;
所述处理器用于根据所述指令进行操作以执行实施例一所述方法的步骤。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明技术原理的前提下,还可以做出若干改进和变形,这些改进和变形也应视为本发明的保护范围。

Claims (10)

  1. 一种基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配方法,其特征在于,包括以下步骤:
    步骤1、描述NOMA-MEC***,所述***以时隙方式运行,时隙集合记为Γ={1,2,...,T};
    步骤2、定义该***的能量效率;
    步骤3、描述优化问题;
    步骤4、定义深度强化学习的状态空间、深度强化学习的动作空间;
    步骤5、构建混合深度强化学习网络;所述网络的输入是状态,输出是动作;
    步骤6、将每一个时隙状态输入混合深度强化学习网络生成动作;
    步骤7、训练混合深度强化学习网络;
    步骤8、重复步骤6和步骤7直到重复次数达到规定的时隙次数T后,输出此时生成的动作,即要优化的决策:用户分组、计算卸载、带宽分配比例。
  2. 根据权利要求1所述的用户分组和资源分配方法,其特征在于,描述NOMA-MEC***的方法包括:
    所述NOMA-MEC***由K个用户设备和一个与边缘服务器相连的单天线基站组成,并且所有用户都只有单个发射天线与基站建立通信链路;所述***以时隙方式运行,时隙集合记为Γ={1,2,...,T};
    将***总带宽B被分为N个正交的子信道,子信道n的带宽占总带宽的比例为τ n
    Figure PCTCN2022127173-appb-100001
    定义K={1,2,...,K}和N={1,2,...,N}分别表示用户集合、正交子信道集合,K≤2N;
    将整个过程划分为一个个时隙,Γ={1,2,...,T};信道增益在一个时隙的时间段内保持不变,在不同时隙间变化,h nk,n∈N,k∈K表示信道n上用户k到基站的信道增益,并且设h n1<h n2<....<h nK,n∈[1,N];
    限制一个信道最多可供两个用户信号同时传输,并且用户在一个时隙内只在一条信道上发送信号;m nk=1表示信道n分配给用户k发送信号,m nk=0表示信道n并不分配给用户k发送信号。
  3. 根据权利要求2所述的用户分组和资源分配方法,其特征在于,定义该***的能量效率的方法包括:
    步骤2.1)该***的能量效率Y定义为所有用户计算速率与计算功率比值的和,如以下公 式:
    Figure PCTCN2022127173-appb-100002
    其中,R i,off表示用户i将计算任务卸载到边缘服务器执行的计算速率,p i为用户i的发送功率,其不随时间变化,并且所有用户的发射功率相同;R i,local表示用户i本地执行任务的计算速率,p i,local表示用户i本地执行的功率,x ni=1代表用户i通过信道n卸载任务到边缘服务器执行,x ni=0代表用户i不通过信道卸载任务到边缘服务器执行;
    步骤2.2)因为信道n上用户i的信道增益h ni大于用户j的信道增益h nj;根据串行干扰消除技术,基站按照用户的信道增益降序解码,则用户i的卸载速率
    Figure PCTCN2022127173-appb-100003
    用户j的卸载速率
    Figure PCTCN2022127173-appb-100004
    其中N 0为噪声的功率谱密度;
    步骤2.3)用户i和用户j的本地执行的计算速率分别为
    Figure PCTCN2022127173-appb-100005
    其中f i和f j为用户的CPU处理能力,
    Figure PCTCN2022127173-appb-100006
    为处理1bit任务所需的周期数;用户i和用户j本地执行的计算功率分别为p i,local=νf i 3、p j,local=νf j 3,其中ν为用户设备芯片架构的电容有效系数;
  4. 根据权利要求1所述的用户分组和资源分配方法,其特征在于,优化问题描述为:(斌给把其中删了)
    Figure PCTCN2022127173-appb-100007
    Figure PCTCN2022127173-appb-100008
    Figure PCTCN2022127173-appb-100009
    Figure PCTCN2022127173-appb-100010
    Figure PCTCN2022127173-appb-100011
  5. 根据权利要求4所述的用户分组和资源分配方法,其特征在于,定义深度强化学习的 状态空间、动作空间的方法包括:
    步骤4.1)所述状态空间s,s={h 11,h 12,...h 1K,h 21,h 22,...,h 2K,h N1...h NK};
    步骤4.2)所述动作空间a由两阶段组成,a={a_c,a_d},其中a_c={τ 12,...,τ N}为连续动作表示***带宽分配比例,a_d={m 11,m 12,...,m 1K,...,m N1,m N2,...,m NK,x 11,x 12,...,x 1K,...,x N1,x N2,...,x NK}为离散动作表示子信道分配方案;
  6. 根据权利要求1所述的用户分组和资源分配方法,其特征在于,构建混合深度强化学习网络的方法包括:
    混合深度强化网络包括连续层深度强化学习网络和离散层深度强化学习网络;连续层深度强化学习网络为DDPG,离散层深度强化学习网络为DQN。
  7. 根据权利要求6所述的用户分组和资源分配方法,其特征在于,将每一个时隙状态输入混合深度强化学习网络生成动作的方法包括:
    步骤6.1)将***状态输入到混合深度强化学习网络中,由DDPG的Actor网络生成a_c带宽分配比例,由DQN网络生成a_d用户分组情况;
    步骤6.2)在户分组情况m nk、带宽分配比例τ n确定后,将最大化***能量效率分解为最大化每个信道的能量效率Y n
    问题转化为
    Figure PCTCN2022127173-appb-100012
    其中矩阵X在每个时间步初始化为零矩阵;(x n,i,x n,j)有4种取值可能,分别为(0,0)、(1,0)、(0,1)、(1,1),其中,x的取值就是确定卸载决策,0表示不将用户设备的计算任务卸载到边缘服务器执行,1表示卸载到边缘服务器执行,将4种组合分别代入上式,选择使Y n最大的组合,重置X对应位置的值。
  8. 根据权利要求7所述的用户分组和资源分配方法,其特征在于,训练混合深度强化学习网络方法包括:
    基站在状态s下,执行动作a={a_c,a_d}后得到环境反馈的即时奖励
    Figure PCTCN2022127173-appb-100013
    并且获取 下一时隙的状态s';
    将(s,a_c,r,s')存储到DDPG经验池,样本(s,a_d,r,s')存储到DQN经验池,DDPG网络和DQN网络共享状态和奖励值;
    DDPG网络和DQN网络从经验池采样D个样本训练更新自身参数。
  9. 一种基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配装置,其特征在于,包括以下步骤:
    ***描述模块:用于描述NOMA-MEC***;
    效率定义模块:用于定义该***的能量效率;
    问题描述模块:用于描述优化问题;
    空间定义模块:用于定义深度强化学习的状态空间、深度强化学习的动作空间;
    网络构建模块:用于构建混合深度强化学习网络;所述网络的输入是状态,输出是动作;
    动作生成模块:用于将每一个时隙状态输入混合深度强化学习网络生成动作;
    网络训练模块:用于训练混合深度强化学习网络;
    输出模块:重复训练次数达到规定的时隙次数T后,输出此时生成的动作,即要优化的决策:用户分组、计算卸载、带宽分配比例。
  10. 一种基于混合深度强化学习的NOMA-MEC***中的用户分组和资源分配装置,其特征在于,包括处理器及存储介质;
    所述存储介质用于存储指令;
    所述处理器用于根据所述指令进行操作以执行权利要求1-8任一项所述方法的步骤。
PCT/CN2022/127173 2022-03-22 2022-10-25 一种noma-mec***中的用户分组和资源分配方法及装置 WO2023179010A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210282489.0A CN114885420A (zh) 2022-03-22 2022-03-22 一种noma-mec***中的用户分组和资源分配方法及装置
CN202210282489.0 2022-03-22

Publications (1)

Publication Number Publication Date
WO2023179010A1 true WO2023179010A1 (zh) 2023-09-28

Family

ID=82668092

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/127173 WO2023179010A1 (zh) 2022-03-22 2022-10-25 一种noma-mec***中的用户分组和资源分配方法及装置

Country Status (2)

Country Link
CN (1) CN114885420A (zh)
WO (1) WO2023179010A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117177275A (zh) * 2023-11-03 2023-12-05 中国人民解放军国防科技大学 基于scma-mec的物联网设备计算速率优化方法
CN117540938A (zh) * 2024-01-10 2024-02-09 杭州经纬信息技术股份有限公司 基于td3强化学习优化的集成式建筑能耗预测方法及***
CN117676896A (zh) * 2023-12-05 2024-03-08 重庆理工大学 基于强化学习的6G支持mIoT资源分配方法及***
CN117833997A (zh) * 2024-03-01 2024-04-05 南京控维通信科技有限公司 一种基于强化学习的noma多波束卫星通信***多维资源分配方法
CN117956523A (zh) * 2024-03-22 2024-04-30 北京新源恒远科技发展有限公司 一种面向车联网边缘计算的任务处理方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114885420A (zh) * 2022-03-22 2022-08-09 南京邮电大学 一种noma-mec***中的用户分组和资源分配方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113543342A (zh) * 2021-07-05 2021-10-22 南京信息工程大学滨江学院 基于noma-mec强化学习资源分配与任务卸载方法
CN113778691A (zh) * 2021-11-10 2021-12-10 清华大学 一种任务迁移决策的方法、装置及***
CN114885420A (zh) * 2022-03-22 2022-08-09 南京邮电大学 一种noma-mec***中的用户分组和资源分配方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113543342A (zh) * 2021-07-05 2021-10-22 南京信息工程大学滨江学院 基于noma-mec强化学习资源分配与任务卸载方法
CN113778691A (zh) * 2021-11-10 2021-12-10 清华大学 一种任务迁移决策的方法、装置及***
CN114885420A (zh) * 2022-03-22 2022-08-09 南京邮电大学 一种noma-mec***中的用户分组和资源分配方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHAO CHEN;XIAODONG WANG: "Decentralized computation offloading for multi-user mobile edge computing: a deep reinforcement learning approach", EURASIP JOURNAL ON WIRELESS COMMUNICATIONS AND NETWORKING, BIOMED CENTRAL LTD, LONDON, UK, vol. 2020, no. 1, 29 September 2020 (2020-09-29), London, UK , pages 1 - 21, XP021282292, DOI: 10.1186/s13638-020-01801-6 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117177275A (zh) * 2023-11-03 2023-12-05 中国人民解放军国防科技大学 基于scma-mec的物联网设备计算速率优化方法
CN117177275B (zh) * 2023-11-03 2024-01-30 中国人民解放军国防科技大学 基于scma-mec的物联网设备计算速率优化方法
CN117676896A (zh) * 2023-12-05 2024-03-08 重庆理工大学 基于强化学习的6G支持mIoT资源分配方法及***
CN117540938A (zh) * 2024-01-10 2024-02-09 杭州经纬信息技术股份有限公司 基于td3强化学习优化的集成式建筑能耗预测方法及***
CN117540938B (zh) * 2024-01-10 2024-05-03 杭州经纬信息技术股份有限公司 基于td3强化学习优化的集成式建筑能耗预测方法及***
CN117833997A (zh) * 2024-03-01 2024-04-05 南京控维通信科技有限公司 一种基于强化学习的noma多波束卫星通信***多维资源分配方法
CN117833997B (zh) * 2024-03-01 2024-05-31 南京控维通信科技有限公司 一种基于强化学习的noma多波束卫星通信***多维资源分配方法
CN117956523A (zh) * 2024-03-22 2024-04-30 北京新源恒远科技发展有限公司 一种面向车联网边缘计算的任务处理方法

Also Published As

Publication number Publication date
CN114885420A (zh) 2022-08-09

Similar Documents

Publication Publication Date Title
WO2023179010A1 (zh) 一种noma-mec***中的用户分组和资源分配方法及装置
Tang et al. Computational intelligence and deep learning for next-generation edge-enabled industrial IoT
CN109729528B (zh) 一种基于多智能体深度强化学习的d2d资源分配方法
Wang et al. Joint resource management for MC-NOMA: A deep reinforcement learning approach
CN111629380B (zh) 面向高并发多业务工业5g网络的动态资源分配方法
CN111556572B (zh) 一种基于强化学习的频谱资源和计算资源联合分配方法
CN110233755B (zh) 一种物联网中雾计算的计算资源和频谱资源分配方法
CN109831808B (zh) 一种基于机器学习的混合供电c-ran的资源分配方法
CN107343268B (zh) 非正交多播和单播传输波束赋型方法及***
CN113596785A (zh) 基于深度q网络的d2d-noma通信***资源分配方法
Wang et al. Multi-agent reinforcement learning-based user pairing in multi-carrier NOMA systems
Cheng et al. Efficient resource allocation for NOMA-MEC system in ultra-dense network: A mean field game approach
Tan et al. Resource allocation of fog radio access network based on deep reinforcement learning
Sun et al. Time-correlated sparsification for efficient over-the-air model aggregation in wireless federated learning
CN114219354A (zh) 一种基于联邦学习资源分配优化方法及***
Mahmoud et al. Federated learning resource optimization and client selection for total energy minimization under outage, latency, and bandwidth constraints with partial or no CSI
Kuang et al. Client selection with bandwidth allocation in federated learning
Luo et al. Communication-aware path design for indoor robots exploiting federated deep reinforcement learning
Yang et al. Training efficiency of federated learning: A wireless communication perspective
CN108449737B (zh) 分布式天线***中基于d2d的下行链路高能效功率分配方法
CN109831759B (zh) 一种基于软件定义无线网络的三维d2d匹配算法
El Haber et al. Latency and reliability aware edge computation offloading in IRS-aided networks
CN115633402A (zh) 一种面向混合业务吞吐量优化的资源调度方法
Sun et al. Energy-efficient resource allocation in c-ran with fronthaul rate constraints
Zhang et al. A convolutional neural network based resource management algorithm for NOMA enhanced D2D and cellular hybrid networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22933061

Country of ref document: EP

Kind code of ref document: A1